Skip to content

jobantonis/Tweet-collection-and-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rutte's influence on taking a shot

What effect do the Dutch government's press conferences have on the attitude of their citizens towards the covid-vaccin?

Motivation

“Social media are implicated in many of contemporary society's most pressing issues, from influencing public opinion, to organizing social movements, to identifying economic trends.” (p. 1, Hemphill Hedstrom & Leonard, 2021).

Since March 2020, the Netherlands has been dealing with the coronacrisis. During the coronacrisis the Dutch prime minister, Mark Rutte, and the deputy prime minister, Hugo de Jonge, communicated by means of press conferences to inform the public on the (updated) regulations. As the coronacrisis took the world by storm not a lot of information was available concerning the semantics of the virus and its implications. This caused a number of problems for the Dutch government.

Firstly, regulations were set based on data provided by the rivm which was not always up to date. There have been instances where data was lagging behind and regulations (e.g. closing businesses and implementing a curfew) were based on this data. Secondly, there was a so called 'arms-race' set in by big pharma to develop a vaccin as soon as possible. Due to the quick development of the covid-vaccines, clinical trials were kept to a bare minimum. These reasons led the Dutch public to skepticism and distrust towards the government, their regulations and the vaccins developed by big pharma. In the Netherlands, it is clear that opinions about Corona, the regulations and the vaccine are divided. Part of the population is protesting and thinks corona is one big hoax, and part of the population 'believes' in the Corona virus being a true pandemic (Erdbrink, 2021). These different opinions lead to different sentiments towards vaccines as well. The social network Twitter is a medium where users can voice their opinions via the use of tweets.

This research sets out to analyse the sentiment of the Dutch public towards a covid vaccin. It is important to know where the general public stands in order to measure the effectiveness of government communication, and to develop a fitting strategy to communicate the distribution and administering of a vaccin. In this research tweets will be scraped based on a set timeframe surrounding each press conference and predetermined covid related keywords. Tweets are preprocessed before computing sentiment scores. Hash symbol (#), mention symbol (@), URLs, extra spaces, and paragraph breaks are cleaned. Punctuations and numbers are included. Advance-level preprocessing, such as (i) correction of incorrectly spelled spelled words, (ii) conversion of abbreviations to their original forms, are bypassed to avoid analysis bottleneck.

Method and results

Our method consists of 5 steps to analyze sentiment data.

Step 1 - Data collection

The first step consists of the data collection. In this research data will be collected from Twitter using a bespoke scraper, which will be utilised on different timeframes and targeted on tweets with covid related keywords. The data will consequently be saved as a csv file. The script preprocessed the data by:

Step 2 - Data Preparation

The second step consists of data preperation and cleaning. A script has been provided that cleans the csv files extracted from the scraper. After the cleaner has been utilised, data will be ready for analysis. The script cleaned/labeled the data by:

  • Removing NA and non-numerical depictions of numbers (1.9k to 1900)
  • Removing all white space and non-valuable text elements
  • Removing Extended, Commercial /trade symbols, and mathematical ASCII symbols
  • Remove duplicate tweets (literal same content)
  • Adjusting timestamp
  • Encoding usernames using numerical values

In the end, the data cleaning and normalization process leaves us with 67.6 % of original data of total tweets.

Step 3 - Sentiment detection

In the third step, each tweet's text field is examined for subjectivity. Tweet text with subjective expressions are retained and objective expressions are discarded.

Step 4 - Sentiment classification

The fourth step classifies each subjective string into groups: positive, negative and neutral based upon their polarity scores. The existing sentiment lexicon used was useful in conjunction with the textual context of the tweets.

Step 5 - Presentation of output

After sentiment classification, the goal of fifth step is to structure the data in a visual and informative manner. The text results are displayed in an array of different graphs.

The outcome of the sentiment analysis is devided into three different plots that indicate the overall fluctuations in positive, negative and neutral tweet sentiments during the press conference. The amount of tweets increased drastically around 19 november 2020. During this time the Dutch government announced during that time, that they were going to buy the corona-vaccins for the Dutch citizens. In addition to that, in hindsight this was all leading up to the press conference at 15-12-2020 were eventually a harsh lock-down in the Netherlands would be announced. Our experiments on twitter sentiment analysis show that there were more positive tweets about the corona-vaccin in total compared to negative and neutral. Like any other method, our proposed method also faces the constraints of real opinions compared to the social media opinions scenario. Image of Keywords in Tweets

We discover the sentiment of the tweets correlating negatively with an press conference and observe a change in sentiment toward the coronavaccin from either negative or positive to neutral longer into the pandemic. The results shown large portion of the records were objective which was approximately between 60 and 45 percent. From this study we can say that people's reactions vary day to day from posting their feelings on social media specifically Twitter. Unfortunately, the pandemic is still far from over and the dynamics of this analysis may very well change in the future. Image of Sentimentanalysis

Repository overview

Overview of the of the directory structure and files:

├── README.md
├── App
│      └── output
│             ├─── RUN RMD FILE TO RENDER APP.txt
│             └─── Polarity_over_time_app.Rmd
├── makefile
├──.gitignore
├── data
│      └── dataset1
│             └─── DataCollection_Twitter.csv
├── gen
│      ├── analysis
│      │      └─ output
│      │          └── sentiment_analysis_visualizations.ipynb
│      └─── data-preparation
│            └─ output
│                └── .gitattributes
├── prerequistites
│      └─── custom_urls.py
├── src
│      ├── collect
│      │      ├── client_secret_needed
│      │      ├── Collect.py
│      │      ├── Upload.py
│      │      └── .DS_Store
│      ├── preparation
│      │      └── clean.R
│      ├── analysis
│      │      └── sentiment_analysis.py
│      └── .DS_Store
└── images
       ├── Image_Number_Tweets_Keywords.PNG
       └── Image_Vaccine_Sentiment_Status.PNG

When entering the main branch of the Github repository, several files and folders can immediately be seen. There are two folders, named "data" & "src". Next to this, a .gitignore, a README.md (which contains all text seen above and below) and a client_secrets.json file can be seen. Entering the "src" folder, one is greeted with two subfolders named "collect" and "preparation". The "collect" folder contains two files named Collection.py . Collection.py contains the webscraper used to scrape the Twitter data and writes the collected data to the "data" folder. The "preparation" folder contains: clean.R. The clean.R code contains the cleaning script, which cleans up the data retrieved by data/dataset1/DataCollection_Twitter.csv . The "analysis" folder contains 1 file: sentiment_analysis.py which allows the cleaned data to be analyzed for sentiment within the tweets and the levels of polarity.

Running instructions

Collect Save dataset Clean Analysis
collection.py collection.py clean.R sentiment_analysis.py
Python & Selenium Python Rstudio Python

To start off, the webscraper (Collection.py) will need to be run. For this to work, the chromedriver for Selenium will need to be installed. Other than that, no further input is required and the webscraper should be able to run if the chromedriver is installed correctly. The dataset is uploaded the data to one's github. Then, the data can be extracted from the github data/dataset1/DataCollection_Twitter.csv . Then, one can run clean.R in order to run the cleaning script, which will automatically work (for info on what packages to install, please see the header "Dependencies for cleaning data & sentiment analyis" below). A clean dataset is then downloaded, and any analysis can be done on it. If sentiment analysis is also done, make sure to import the correct file when starting the analysis. The sentiment analysis can be ran using sentiment_analysis.py file.

More resources

Academic background information about the dataset creation: The dataset contains keywords indicated by the research of Ramírez-Sáyago (2020) to find relevant tweets about the COVID-19 virus. However the COVID-vaccine is the main focus in this dataset, therefore the keywords of Ramírez-Sáyago (2020) are combined with "vaccin(e)" behind it. However the keywords "coronavirus" and "corona" were not used in this research. The research of Kruspe et al., (2020) did use these keywords to their sentiment analysis, therefore these keywords were also added combined with the keyword "vaccine" in the composition of the dataset.

Some part of the Dutch population is protesting and thinks corona is one big hoax, this sentiment analysis does not touch upon it's relationship between the believes of Twitter uses in this hoax or not. The latter is described in this actricle in the New York times.

About

This study was conducted as part of the course, Online Data Collections (oDCM) from Tilburg University. The cleaned data by the scraper built will then be used for the course, Data Prep and Workflow Management (dPrep) from Tilburg University. All team members of team 1 were involved in the process of building, developing, optimizing, cleaning and finally documenting and analyzing the data. Team members: Anouk van Gestel, Job Antonis, Marc Lefebvre, Raul Kleinherenbrink and Lieke Adams.

Dependencies for scraper

Python Py packages: import getpass, time, selenium.webdriver.common.keys, selenium.webdriver.chrome, Use this link to download the Selenium Chrome webdriver. Make sure to select the correct version for Chrome.

Dependencies for cleaning data & sentiment analysis

R R packages: install.packagesstringr, tidyverse, data.table.

Python Py packages: import matplotlib.pyplot, pandas, numpy, datetime , csv, matplotlib.dates.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages