# Twitter scraper COVID vaccine press conferences within the Netherlands

We built a Twitter scraper in order to find out the sentiment towards to COVID-19 vaccine in the Netherlands, within the time frames surrounding the Dutch governments' Press Conferences. Herewith we want to measure the influence of the regulations regarding COVID-19 on the attitude people have towards COVID-19 vaccines. 

Our scraper results in a CSV file, containing the following entities:
- Dutch tweets containing the following keywords: *coronavaccin, corona_vaccin, covidvaccin, covid_vaccin, covid_vaccine, covidvaccine, coronavaccine, covid_vaccine*   
- Specific dates the tweets were posted (around the persconferences (3 days before and 3 days after), *datums + timeframe n.t.b. https://www.rijksoverheid.nl/onderwerpen/coronavirus-covid-19/coronavirus-beeld-en-video/videos-persconferenties*) 
- Content of the tweets
- User ID *anoniem gemaakt d.m.v. ...*

##Datum nog veranderen!!!

For our code, we decided to use Selenium because Twitter is a dynamic webpage where we need to mimic scrolling like a Twitter user.
Therefore, we first have to download and import some drivers and prerequisites.

In [108]:
import csv
import selenium.webdriver   
from getpass import getpass
from time import sleep
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

Then, we wrote a function in order to get multiple tweets. We used the .find()-function in order to find specific elements in the source code of the website. Below, we inserted 2 pictures, showing examples of how we found the specific elements (for username and comment). The '.text'-function makes sure we copy the text belonging to the element we were looking for.

<img src="https://raw.githubusercontent.com/jobantonis/Tremendously-awesome-repository/main/Screenshot%202021-03-11%20at%2017.17.12.png" align="center" width=60%/>

In [109]:
def get_tweet_data(card):
    username = card.find_element_by_xpath('.//span').text
    try:
        handle = card.find_element_by_xpath('.//span[contains(text(), "@")]').text 
    except NoSuchElementException:
        return
    
    try:
        postdate = card.find_element_by_xpath('.//time').get_attribute('datetime')
    except NoSuchElementException:
        return
    
    comment = card.find_element_by_xpath('.//div[2]/div[2]/div[1]').text
    responding = card.find_element_by_xpath('.//div[2]/div[2]/div[2]').text
    #text = comment + responding
    reply_cnt = card.find_element_by_xpath('.//div[@data-testid="reply"]').text
    retweet_cnt = card.find_element_by_xpath('.//div[@data-testid="retweet"]').text
    like_cnt = card.find_element_by_xpath('.//div[@data-testid="like"]').text
    
    tweet = {'username':username, 
             'handle':handle, 
             'date': postdate,
             'comment': comment, 
             'responding': responding, 
             'reply_cnt':reply_cnt, 
             'retweet_cnt': retweet_cnt, 
             'like_cnt': like_cnt}
    return(tweet)

We use Selenium, and therefore have to choose a browser in which we are going to open Twitter. We decided to use Chrome, so we set our Driver to Chrome.

In [110]:
driver = selenium.webdriver.Chrome()

We do not want to log into Twitter. Therefore, we chose to use the URL containing the keywords and the specific dates we want to search for. Next, we open the Driver into Chrome. 

In [111]:
driver.get('https://twitter.com/search?q=(coronavaccin%20OR%20corona_vaccin%20OR%20covidvaccin%20OR%20covid_vaccin%20OR%20corona_vaccine%20OR%20coronavaccine%20OR%20covidvaccine%20OR%20covid_vaccine)%20lang%3Anl%20until%3A2020-04-02%20since%3A2020-03-28&src=typed_query&f=live') 
#driver.maximize_window() 

Next, we want to fill the data with our input. Therefore, we create a list called 'data'. Furthermore, since it is a dynamic website which is accessed through scrolling through the site, we want to make sure we do not add the same tweets twice, therefore the function set(), called tweet_ids is added. Lastly, we added a function that executes the script and stops when we are at the last page. 

In [112]:
data = [] #needed to fill a string with data
tweet_ids = set() #Used in order to mitigate scraping duplicate tweets caused by possible increasing number of tweets per scroll 
last_position = driver.execute_script("return window.pageYOffset;") #For tracking scroll position, breaking out of loop if end is reached
scrolling = True

While scrolling, we want to make sure to capture our data from the tweets we encounter. After that, we actually begin to scroll down the pages. Since we are aware that there is a possibility that our internet laggs in between because of poor internet connection, we made sure our scraper tries to scroll twice before it is disregarded.

In [113]:
while scrolling:
    
    page_cards = driver.find_elements_by_xpath('//div[@data-testid="tweet"]') #First div with which the scraper encounters a tweet
    for card in page_cards[-15:]: # used -15, since we assume there are 15 tweets loaded each time
        tweet = get_tweet_data(card) #get 15 tweets per reload
        if tweet not in data:
            data.append(tweet)    
        
 #       if tweet:
 #           tweet_id = tweet #make 1 tweet of the separate words used in the tweet
 #           if tweet_id not in tweet_ids: #Make sure you do not capture the same tweet twice, by only appending tweets that you haven't added before. 
 #               tweet_ids.add(tweet_id) #to the set() function we add tweet_ids
 #               data.append(tweet) #add tweets to data list

    scroll_attempt = 0 #used as sometimes due to lag scrolling will not register, so we allow for number of scroll attempts
    while True:
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        sleep(1) #giving program time to load before scraping
        curr_position = driver.execute_script("return window.pageYOffset;")
        if last_position == curr_position: #breaks out of loop if current and last scroll positions are the same
            scroll_attempt += 1 
            #2 attempts to check if it is the end of the page, since it is also possible that the scraper laggs because of poor internet connection.
            if scroll_attempt >= 3: 
                scrolling = False
                break
            else: 
                sleep(2) 
        else:
            last_position = curr_position #makes sure the loop ends
            break

Copy the data to a CSV file, creating headers and writing data towards the file using the following code:

In [55]:
data

[{'username': 'Koos Dirkse',
  'handle': '@koosdirkse',
  'date': '2020-03-15T23:49:15.000Z',
  'comment': "De man heeft werkelijk geen idee waar hij mee bezig is!\n‘Trump wil Duits bedrijf kopen voor exclusiviteit coronavaccin in VS' https://nu.nl/coronavirus/6037613/trump-wil-duits-bedrijf-kopen-voor-exclusiviteit-coronavaccin-in-vs.html… via \n@NUnl",
  'responding': "'Trump wil Duits bedrijf kopen voor exclusiviteit coronavaccin in VS'\nDe Amerikaanse regering wil een Duits bedrijf kopen dat bezig is met de ontwikkeling van een vaccin tegen COVID-19, schrijft de Duitse krant Die Welt zondag op basis van bronnen dicht bij de Duitse...\nnu.nl",
  'reply_cnt': '6',
  'retweet_cnt': '5',
  'like_cnt': '6'},
 {'username': 'Victor Robert Poppenburg',
  'handle': '@secondhomemondi',
  'date': '2020-03-15T23:36:12.000Z',
  'comment': '‘Trump probeert coronavaccin te kapen van Duits biotechbedrijf’. https://fd.nl/ondernemen/1338050/trump-probeert-coronamedicijn-te-kapen-van-duits-biotechbed

In [114]:
len(data)

132

In [115]:
with open("coronavaccin_tweetsuntil%3A2020-04-02%20since%3A2020-03-28.csv", "w", newline="", encoding='utf-8') as csv_file:
  cols = ['username', 
             'handle', 
             'date',
             'comment', 
             'responding', 
             'reply_cnt', 
             'retweet_cnt', 
             'like_cnt'] 
  writer = csv.DictWriter(csv_file, fieldnames=cols, restval='MISSING')
  writer.writeheader()
  writer.writerows(data)


