# Twitter Scraper
A Twitter Scraper using the Selenium Library to scrape 'Twitter' Data related to the DSiRBS Research Project. 


<*developed by Jonas-Mika Senghaas (DS 2020)*>
<*27.10.2020*>

---
## To-Do's 
- [ ]  Only Scrape tweets with data from corona outbreak, ie day of lockdwon in Denmark: (2020-03-13)
- [x]  Make the Looping work
- [x]  Adjust the loading times for Logging into Twitter
- [x]  Make the Output File of the type: "scraping: {searchterm}"
- [x]  Let the Program Enter the Login Credentials
---

## Imports + Functions
---
We need several imports to make the program work. Each of the imports is explained in a comment next to it. The cell also contains the central function, that is called to get all necessary data from a tweet. We will use it in the actual Scraping Cell.

In [None]:
import csv # used to ouput the data into a csv file (that we can analyze)
from getpass import getpass # used to secretly type in the twitter password
from time import sleep  # used to delay the scraping to allow for scrolling and secure for DDOS blocking of webpage

# selenium library to navigate browser and scrape twitter
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import Chrome # we use the chrome browser for the twitter webscraping 


def get_tweet_data(card):
    """
    Extract data from tweet card using xpath searches for each important element of the tweet
    """
    username = card.find_element_by_xpath('.//span').text
    try: # since advertised post dont have a handle and a postdate, we can easily filter them out using error handling
        handle = card.find_element_by_xpath('.//span[contains(text(), "@")]').text
    except NoSuchElementException:
        return
    
    try:
        postdate = card.find_element_by_xpath('.//time').get_attribute('datetime')
    except NoSuchElementException:
        return

    comment = card.find_element_by_xpath('.//div[2]/div[2]/div[1]').text
    responding = card.find_element_by_xpath('.//div[2]/div[2]/div[2]').text
    text = comment + responding
    reply_cnt = card.find_element_by_xpath('.//div[@data-testid="reply"]').text
    retweet_cnt = card.find_element_by_xpath('.//div[@data-testid="retweet"]').text
    like_cnt = card.find_element_by_xpath('.//div[@data-testid="like"]').text

    # concatenate all the scraped data into a tuple return it
    tweet = (username, handle, postdate, text, reply_cnt, retweet_cnt, like_cnt)
    
    return tweet

## Preparing for Scraping
### Log into Twitter and Navigate to desired website to scrape
____
Executing this block, will ask the user to input *username*, *password* (via getpass) and the *searchterm* that wants to be scraped and login the user with the inserted credentials and navigate to the webpage, that contains all the tweets related to the searchterm sorted by Latest.

In [None]:
# get user inputs
my_username = input('Enter your Twitter Username: ')
my_password = getpass()
searchterm = input('The tweets of which searchterm do you want to scrape?: ')

# instantiate a driver object from seleniums subpackage webdriver using the Chrome class    x
PATH = "../webdrivers/chromedriver"
driver = Chrome(PATH)

driver.get('https://twitter.com/login/') # navigate to twitter login page using 'get' module
sleep(2)

# enter the username and password
username = driver.find_element_by_xpath('//input[@name="session[username_or_email]"]')
username.send_keys(my_username)
password = driver.find_element_by_xpath('//input[@name="session[password]"]') # enter the password
password.send_keys(my_password)
password.send_keys(Keys.RETURN) # login using the login button (similar action than to just press enter)
sleep(2)

# enter the searchterm 
search_input = driver.find_element_by_xpath('//input[@aria-label="Search query"]')
search_input.send_keys(searchterm)
search_input.send_keys(Keys.RETURN)
sleep(2)

driver.find_element_by_link_text('Latest').click()
sleep(1)

## Scraping the Data
### Runnning while loop until end of page and store all tweet data in list
---
When this block is executed, the scraper continusely scrapes all tweets on the page loaded with the inserted searchterm. 

In [None]:
# get all tweets on the page

data = [] # this is the list that will contain the data for each twitter as a tuple
tweet_ids = set() # we use this to make sure that our dataset consist of unique tweets
last_position = driver.execute_script("return window.pageYOffset;")
scrolling = True

while scrolling:
    page_cards = driver.find_elements_by_xpath('//div[@data-testid="tweet"]')
    for card in page_cards[-15:]:
        tweet = get_tweet_data(card)
        if tweet:
            tweet_id = ''.join(tweet)
            if tweet_id not in tweet_ids:
                tweet_ids.add(tweet_id)
                data.append(tweet)
            
    scroll_attempt = 0
    while True:
        # check scroll position
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        sleep(3)
        curr_position = driver.execute_script("return window.pageYOffset;")
        if last_position == curr_position:
            scroll_attempt += 1
            
            # end of scroll region
            if scroll_attempt >= 3:
                scrolling = False
                break
            else:
                sleep(3) # attempt another scroll
        else:
            last_position = curr_position
            break

# close the web driver
driver.close()

## Saving
### Store scraped data in above cell into csv-file named according to the inserted searchterm
---
Executing this cell, will create a csv-file named according to the searchterm that was inserted before and save it into the directory the jupyter file is saved in.

In [None]:
with open(f"scraped_'{searchterm}'.csv", 'w', newline='', encoding='utf-8') as outfile:
    header = ['Username', 'Handle', 'Timestamp', 'Text', 'Comments', 'Likes', 'Retweets']
    writer = csv.writer(outfile)
    writer.writerow(header)
    writer.writerows(data) 