# Sentiment analysis of Dutch COVID-19 related twitter messages

## Research question
*Is it possible to predict whether a COVID-19 related Twitter message is positve or negative?*

#### Background information about the project
* This project is part of the Ironhack Data Analytics course I followed between June and December 2020.
* Project duration: 14th - 28th November 2020.
* Overarching goals:
   1. Learn to learn and apply new tools and methodologies. Keeping up with developments and teach yourself new tools is an important trait of a data analyst;
   2. Focus on learning a skill transferrable to the final project.
* Specific goals for Natural Language Processing projects:
    * Design and build an NLP prototype. 
    * Integrate a machine learning algorithm in an NLP framework (e.g. a chatbot or sentiment analyzer)  
    * Use evaluation metrics to ground your claims and test your model. 
    * Communicate the results of your analysis clearly, accurately, and engagingly. Demonstrate your working prototype. 
    * Learn to tailor the style of your communication to the audience.

#### Download the dataset
Please download the dataset from [here](https://drive.google.com/file/d/1JnvJrmTC0XgA5XwU9Tywxb8Cliqe8Gi8/view?usp=sharing) and place it in the data folder.

#### Context about the data
The dataset contains around 17.000 Dutch twitter messages. The Twitter messages were obtained via the Twitter API using Tweepy. You can find the code in Python file 'How to get data from Twitter' in this folder.

The Twitter messages have the following characteristics:
- Dutch language label.
- Created between 16 Nov 2020 and 24 Nov 2020.
- Retweets are filtered out of the dataset.
- Each message has at least one of the following hashtags: #corona, #CoronaNL, #coronanederland, #coronavirus, #coronavirusnl, #coronavirusNederland, #covid19, #COVID19NL, #covid19nederland, #covid-19, #covid_19, #covid_19nl and #Covid_19Nederland.

The first data cleaning was already done obtaining the messages. Only two colums are kept and can be found in this dataset: the creation date and text of the message.

#### Import libraries
First, install and import the necessary libraries to conduct the analysis.

In [1]:
import pandas as pd
import numpy as np
import random
from textblob import TextBlob
from textblob_nl import PatternTagger, PatternAnalyzer

# Install libraries if needed
# pip install textblob
# pip install textblob-nl

#### Import datasets
Secondly, import the dataset.

In [2]:
tweets = pd.read_csv('data\\twitterdata.csv')

#### Inspecting & cleaning the data
1. We first set the maximum column width to infinite number of characters. The default is 50, which means we cannot read the whole message.
2. We print the first 5 rows of the dataframe to see what the dataset looks like.
3. The datatype of both columns is 'object'. Since 'created at' is a date, we convert this column to datetime.
4. At midnight on Thursday the 19th of November, part of the lockdown measures in the Netherlands was widened. Meaning people were allowed to have more vistors at home and visiting a museum for instance. This might influence the sentiment of the messages. Therefore we add an extra column to the dataframe: the time difference between the widening of the measures and creation of the message.

In [3]:
# Set column width to infinite number of characters
pd.set_option('max_colwidth', None)

In [4]:
# Inspect the first 5 rows
tweets.head()

Unnamed: 0,created_at,text
0,Tue Nov 24 19:29:03 +0000 2020,Het #omt discussieert over de kerstvakantie #Corona.\nGooi 4 wekenlang dicht.... heerlijk!
1,Tue Nov 24 19:29:01 +0000 2020,Gaat dit eigenlijk wel over onze gezondheid?\nDocumentaire over de #Corona crisis.\nAlle medicijnen zijn in het begin… https://t.co/S11tNQ2yRI
2,Tue Nov 24 19:28:40 +0000 2020,President @EmmanuelMacron: heropening winkels en gebedsplaatsen vanaf zaterdag 28NOV; musea-cinema’s vanaf 15DEC; c… https://t.co/uaODWssHJ4
3,Tue Nov 24 19:27:50 +0000 2020,Haha wat een waanzin. Tijdens de wedstrijd zitten ze op en over elkaar maar hadden schudden nadien verspreid covid-… https://t.co/AjCBjNHZLp
4,Tue Nov 24 19:27:34 +0000 2020,@Wiep13396680 @dekoran1 @LidwienNews @JohannesKingma Als de sekte @D66 verdwijnt is #Corona ook verdwenen.


In [5]:
# Inspect datatypes
tweets.dtypes

created_at    object
text          object
dtype: object

In [6]:
# Convert created_at column to datetime
tweets['created_at'] = pd.to_datetime(tweets['created_at'])
tweets.dtypes

created_at    datetime64[ns, UTC]
text                       object
dtype: object

#### Calculating in the dataframe

1. First create an extra column to map the time difference since lockdown widening. Interpretation of the value:
    - Negative timedifference: the tweet was send before the widening of the measures
    - Positive timedifference: the tweet was send after the widening of the measures
2. Oldest message is from the 16th of November 18:44.
3. Most recent message is from 24th of November 19:29.

In [7]:
# First add a column with the date of the widening of the lockdown
tweets['lockdown_widening'] = '2020-11-19 00:00:01+00:00'

# Convert column to datetime
tweets['lockdown_widening'] = pd.to_datetime(tweets['lockdown_widening'])

# Check dtypes in dataframe
tweets.dtypes

# Create an extra column: time difference
tweets['timedelta'] = tweets['created_at'] - tweets['lockdown_widening']

# Inspect head of dataset
tweets.tail()

Unnamed: 0,created_at,text,lockdown_widening,timedelta
17489,2020-11-16 18:45:32+00:00,Spieballen\nBen je Positief ga je een kamertje in en is iedere contact met een ander mens VERBODEN\nWordt Stinken 🤪… https://t.co/abJZdGcA1V,2020-11-19 00:00:01+00:00,-3 days +18:45:31
17490,2020-11-16 18:45:13+00:00,Onderzoek: covid-19 tast belangrijkste ademspier aan bij ernstig zieken - https://t.co/Yb2sqI6Fnr,2020-11-19 00:00:01+00:00,-3 days +18:45:12
17491,2020-11-16 18:44:56+00:00,"@jeacath @freek_niesten Wat een kromme redenering. Iedere dag je tanden poetsen is ook niet te verwaarlozen, maar h… https://t.co/6JCcAA7crF",2020-11-19 00:00:01+00:00,-3 days +18:44:55
17492,2020-11-16 18:44:50+00:00,Wanneer wordt #Corona overwonnen in ons deel van de wereld? Ongeveer in maart 2021. #vaccin \n\nWanneer zijn de Tweed… https://t.co/bFkvHDcVyn,2020-11-19 00:00:01+00:00,-3 days +18:44:49
17493,2020-11-16 18:44:16+00:00,Sorry maar #thuiswerkers hebben het makkelijk genoeg. Ga eens in de zorg kijken #corona #luxeprobleem @RTLnieuws,2020-11-19 00:00:01+00:00,-3 days +18:44:15


In [8]:
# Oldest message
tweets['created_at'].min()

Timestamp('2020-11-16 18:44:16+0000', tz='UTC')

In [9]:
# Most recent message
tweets['created_at'].max()

Timestamp('2020-11-24 19:29:03+0000', tz='UTC')

## Obtaining labeled Twitter messages
To be able to create a ML algorithm that predicts the sentiment of a message (negative or positive) there needs to be a labeled dataset, i.e. a dataset with two columns:
1. The text of a messages
2. The sentiment of the message (positive or negative).

Since I couldn't find a pre labeled dataset, I will create one myself.

#### Creating a sample and population dataset
*Disclaimer: Since the list with random numbers is modified each time this kernel runs, the sample dataset will differ. Therefore the code to save it to a csv file is preceded by a #. So it will not overwrite the csv files*.

1. Create a sample dataset using random numbers. Save it as a dataframe.
2. Save remaining data in a seperate dataframe.

In [10]:
# Number of rows in the dataset (population)
population_size = len(tweets)
population_size

17494

In [11]:
# Create a list with 500 random integers
randomlist = random.sample(range(0, population_size), 500)

In [12]:
# Select rows corresponding to the 500 random chosen integers
tweets_sample = tweets.iloc[randomlist]
tweets_sample.head()

Unnamed: 0,created_at,text,lockdown_widening,timedelta
5466,2020-11-21 17:48:26+00:00,@line_forward @Zophia79307590 Zou best eens kunnen.\nDe covid-19 testkits zijn in 2018 al gedistribueerd. https://t.co/wwUZRW3g9k,2020-11-19 00:00:01+00:00,2 days 17:48:25
285,2020-11-24 17:03:44+00:00,Met of zonder Coca-Cola? #AkkoordGezondeVoedingOpScholen #CocaCola #Leefstijl #mentalegezondheidsproblemen… https://t.co/q45Zd3Uzdz,2020-11-19 00:00:01+00:00,5 days 17:03:43
12879,2020-11-18 14:39:29+00:00,Vanaf 1 december 2020 geldt een mondkapjesplicht in de publieke binnenruimtes. https://t.co/uCLpyJwPS3 #coronavirus… https://t.co/Nh2cFr6k7f,2020-11-19 00:00:01+00:00,-1 days +14:39:28
13127,2020-11-18 13:09:48+00:00,#vlog 19 #60plusser nadeel online netwerken + voordeel #tijd #Covid_19 https://t.co/c1BY78KThV,2020-11-19 00:00:01+00:00,-1 days +13:09:47
3432,2020-11-23 07:06:44+00:00,"Besmettingen op VO-scholen neemt toe. Niet heel vreemd, maatregelen worden minder nageleefd, want de ventilatie is… https://t.co/Q3YcnYaWWL",2020-11-19 00:00:01+00:00,4 days 07:06:43


In [13]:
# Select rows not corresponding to the 500 random chosen integers
tweets_remaining = tweets.loc[~tweets.index.isin(randomlist)]
tweets_remaining.head()

Unnamed: 0,created_at,text,lockdown_widening,timedelta
0,2020-11-24 19:29:03+00:00,Het #omt discussieert over de kerstvakantie #Corona.\nGooi 4 wekenlang dicht.... heerlijk!,2020-11-19 00:00:01+00:00,5 days 19:29:02
1,2020-11-24 19:29:01+00:00,Gaat dit eigenlijk wel over onze gezondheid?\nDocumentaire over de #Corona crisis.\nAlle medicijnen zijn in het begin… https://t.co/S11tNQ2yRI,2020-11-19 00:00:01+00:00,5 days 19:29:00
2,2020-11-24 19:28:40+00:00,President @EmmanuelMacron: heropening winkels en gebedsplaatsen vanaf zaterdag 28NOV; musea-cinema’s vanaf 15DEC; c… https://t.co/uaODWssHJ4,2020-11-19 00:00:01+00:00,5 days 19:28:39
3,2020-11-24 19:27:50+00:00,Haha wat een waanzin. Tijdens de wedstrijd zitten ze op en over elkaar maar hadden schudden nadien verspreid covid-… https://t.co/AjCBjNHZLp,2020-11-19 00:00:01+00:00,5 days 19:27:49
4,2020-11-24 19:27:34+00:00,@Wiep13396680 @dekoran1 @LidwienNews @JohannesKingma Als de sekte @D66 verdwijnt is #Corona ook verdwenen.,2020-11-19 00:00:01+00:00,5 days 19:27:33


#### Determine the sentiment of the sample dataset using Textblob
First create an extra column with the outcome of the Textblob sentiment analyzer. This results in a value that is a tuple (polarity, subjectivity):
* Polarity is a float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.
* Subjectivity sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

In [14]:
# Define a function that returns the sentiment of a string.
def sentiment(text):
    """input: string variable (text) you want to know the sentiment of
    output: sentiment in a tuple (polarity, subjectivity)
    """
    blob = TextBlob(text, pos_tagger=PatternTagger(), analyzer=PatternAnalyzer())
    sentiment = blob.sentiment
    return sentiment

# test with example sentence, outcome should be (-0.1, 0.4)
text = "De kat wil wel vis eten maar geen poot nat maken."
sentiment(text)

(-0.1, 0.4)

In [15]:
# Map the function to the text column
tweets_sample['sentiment'] = tweets_sample['text'].map(sentiment)
tweets_sample.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweets_sample['sentiment'] = tweets_sample['text'].map(sentiment)


Unnamed: 0,created_at,text,lockdown_widening,timedelta,sentiment
5466,2020-11-21 17:48:26+00:00,@line_forward @Zophia79307590 Zou best eens kunnen.\nDe covid-19 testkits zijn in 2018 al gedistribueerd. https://t.co/wwUZRW3g9k,2020-11-19 00:00:01+00:00,2 days 17:48:25,"(0.10250000000000002, 0.4100000000000001)"
285,2020-11-24 17:03:44+00:00,Met of zonder Coca-Cola? #AkkoordGezondeVoedingOpScholen #CocaCola #Leefstijl #mentalegezondheidsproblemen… https://t.co/q45Zd3Uzdz,2020-11-19 00:00:01+00:00,5 days 17:03:43,"(0.0, 0.0)"
12879,2020-11-18 14:39:29+00:00,Vanaf 1 december 2020 geldt een mondkapjesplicht in de publieke binnenruimtes. https://t.co/uCLpyJwPS3 #coronavirus… https://t.co/Nh2cFr6k7f,2020-11-19 00:00:01+00:00,-1 days +14:39:28,"(0.05, 0.15000000000000002)"
13127,2020-11-18 13:09:48+00:00,#vlog 19 #60plusser nadeel online netwerken + voordeel #tijd #Covid_19 https://t.co/c1BY78KThV,2020-11-19 00:00:01+00:00,-1 days +13:09:47,"(0.0, 0.0)"
3432,2020-11-23 07:06:44+00:00,"Besmettingen op VO-scholen neemt toe. Niet heel vreemd, maatregelen worden minder nageleefd, want de ventilatie is… https://t.co/Q3YcnYaWWL",2020-11-19 00:00:01+00:00,4 days 07:06:43,"(0.10416666666666666, 0.35416666666666663)"


In [16]:
# Split the sentiment column into two columns: polarity and subjectivity
tweets_sample['sentiment'].tolist()                                                                                                                                                        
pd.DataFrame(tweets_sample['sentiment'].tolist(), index=tweets_sample.index)                                                                                                                                          
tweets_sample[['polarity', 'subjectivity']] = pd.DataFrame(tweets_sample['sentiment'].tolist(), index=tweets_sample.index)  
tweets_sample.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,created_at,text,lockdown_widening,timedelta,sentiment,polarity,subjectivity
5466,2020-11-21 17:48:26+00:00,@line_forward @Zophia79307590 Zou best eens kunnen.\nDe covid-19 testkits zijn in 2018 al gedistribueerd. https://t.co/wwUZRW3g9k,2020-11-19 00:00:01+00:00,2 days 17:48:25,"(0.10250000000000002, 0.4100000000000001)",0.1025,0.41
285,2020-11-24 17:03:44+00:00,Met of zonder Coca-Cola? #AkkoordGezondeVoedingOpScholen #CocaCola #Leefstijl #mentalegezondheidsproblemen… https://t.co/q45Zd3Uzdz,2020-11-19 00:00:01+00:00,5 days 17:03:43,"(0.0, 0.0)",0.0,0.0
12879,2020-11-18 14:39:29+00:00,Vanaf 1 december 2020 geldt een mondkapjesplicht in de publieke binnenruimtes. https://t.co/uCLpyJwPS3 #coronavirus… https://t.co/Nh2cFr6k7f,2020-11-19 00:00:01+00:00,-1 days +14:39:28,"(0.05, 0.15000000000000002)",0.05,0.15
13127,2020-11-18 13:09:48+00:00,#vlog 19 #60plusser nadeel online netwerken + voordeel #tijd #Covid_19 https://t.co/c1BY78KThV,2020-11-19 00:00:01+00:00,-1 days +13:09:47,"(0.0, 0.0)",0.0,0.0
3432,2020-11-23 07:06:44+00:00,"Besmettingen op VO-scholen neemt toe. Niet heel vreemd, maatregelen worden minder nageleefd, want de ventilatie is… https://t.co/Q3YcnYaWWL",2020-11-19 00:00:01+00:00,4 days 07:06:43,"(0.10416666666666666, 0.35416666666666663)",0.104167,0.354167


#### Based on TextBlob, categorize sentiment as positive, negative or neutral

In [17]:
# Define a function to determine the sentiment category
def sentiment_cat(polarity):
    """input: float, based on the output of TextBlob polarity measure, you want to categorize
    output: sentiment category, either positive, negative or neutral
    """
    polarity = float(polarity)
    if polarity > 0:
        return 'positive'
    
    if polarity < 0:
        return 'negative'
    
    if polarity == 0:
        return 'neutral'
    
    else:
        return 'error'
    
# test with example (-0.1 from previous example) should return negative
polarity = "-0.1"
sentiment_cat(polarity)

'negative'

In [18]:
# Map the function to the polarity column
tweets_sample['sentiment_cat'] = tweets_sample['polarity'].map(sentiment_cat)
tweets_sample.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweets_sample['sentiment_cat'] = tweets_sample['polarity'].map(sentiment_cat)


Unnamed: 0,created_at,text,lockdown_widening,timedelta,sentiment,polarity,subjectivity,sentiment_cat
5466,2020-11-21 17:48:26+00:00,@line_forward @Zophia79307590 Zou best eens kunnen.\nDe covid-19 testkits zijn in 2018 al gedistribueerd. https://t.co/wwUZRW3g9k,2020-11-19 00:00:01+00:00,2 days 17:48:25,"(0.10250000000000002, 0.4100000000000001)",0.1025,0.41,positive
285,2020-11-24 17:03:44+00:00,Met of zonder Coca-Cola? #AkkoordGezondeVoedingOpScholen #CocaCola #Leefstijl #mentalegezondheidsproblemen… https://t.co/q45Zd3Uzdz,2020-11-19 00:00:01+00:00,5 days 17:03:43,"(0.0, 0.0)",0.0,0.0,neutral
12879,2020-11-18 14:39:29+00:00,Vanaf 1 december 2020 geldt een mondkapjesplicht in de publieke binnenruimtes. https://t.co/uCLpyJwPS3 #coronavirus… https://t.co/Nh2cFr6k7f,2020-11-19 00:00:01+00:00,-1 days +14:39:28,"(0.05, 0.15000000000000002)",0.05,0.15,positive
13127,2020-11-18 13:09:48+00:00,#vlog 19 #60plusser nadeel online netwerken + voordeel #tijd #Covid_19 https://t.co/c1BY78KThV,2020-11-19 00:00:01+00:00,-1 days +13:09:47,"(0.0, 0.0)",0.0,0.0,neutral
3432,2020-11-23 07:06:44+00:00,"Besmettingen op VO-scholen neemt toe. Niet heel vreemd, maatregelen worden minder nageleefd, want de ventilatie is… https://t.co/Q3YcnYaWWL",2020-11-19 00:00:01+00:00,4 days 07:06:43,"(0.10416666666666666, 0.35416666666666663)",0.104167,0.354167,positive


#### Save output as Excel file. This output will be manually judged outside Python

In [20]:
# Remove unnecessary columns. Remaining columns: text and sentiment_cat
tweets_sample_cat = tweets_sample[['text', 'sentiment_cat']]

# Store column names in a list to use as input for saving as excel file
col_names = tweets_sample_cat.columns.tolist()

In [21]:
# Save dataframe as excel file
tweets_sample_cat.to_excel("data\\twitter_sample_sentiment.xlsx", sheet_name='Twitter_sentiment', columns=col_names, index=False)