# Exploratory Data and Sentiment Analysis on COVID-19 related tweets via TwitterAPI

### Author: George Spyrou
### Date: 01/03/2020

Purpose of this project is to leverage the TwitterAPI functionality offered by Twitter, and conduct an analysis on tweets that are related with the COVID-19 virus. Initially, this project started as an exploratory task to learn how the TwitterAPI can be used to retrieve data (tweets) from the web, and how to use the tweepy and searchtweets python packages. 

After I managed to retrieve the data, I found myself really interested into performing some data analysis on the retrieved tweets. COVID-19 - or as it's been commonly known as coronavirus - is one of the most discussed topics in Twitter for the period 01/01/2020 - 01/03/2020. Using relevant tweets, we want to perform some exploratory data analysis (e.g. find the most common words used in tweets related to covid-19 or identify the bigrams) and then attempt to identify the sentiment of the tweets by using a variety of methods.

#### Version 1: The first version has been completed on 01/03/2020 and it includes analysis on:
- Most common words present in tweets.
- Most common bigrams (i.e. pairs of words that often appear next to each other).
- Sentiment analysis by using the Liu Hu opinion lexicon algorithm.
    
At the first part of the project, we deal with setting up the environment required for our analysis, as well as retrieving the data by using the TwitterAPI.

In [5]:
# Import dependenciescle
import os
import json
import pandas as pd
from datetime  import datetime, timedelta

from collections import Counter

# Plots and graphs
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns

# NLTK module for text preprocessing and analysis
from nltk import word_tokenize
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

from nltk.corpus import stopwords

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

from plotly.offline import plot

# Set up the project environment

# Secure location of the required keys to connect to the API
# This config also contains the search query (in this case 'coronavirus')
json_loc = 'C:\\Users\\george\\Desktop\\Twitter_Project\\Twitter\\twitter_config.json'


with open(json_loc) as json_file:
    configFile = json.load(json_file)

# Project folder location and keys
os.chdir(configFile["project_directory"])

For this project we had to create a variety of functions, some of which have been used in order to retrieve/clean the data, as well as the functions that we have used for our main analysis and plotting. For more information regarding this functions, please refer to the **twitterCustomFunc.py** file.

In [None]:
# Import the custom functions that we will use to retrieve and analyse the data

import twitterCustomFunc as twf

from searchtweets import load_credentials
from searchtweets import gen_rule_payload
from searchtweets import ResultStream

twitter_keys_loc = configFile["keys"]

# Load the credentials to get access to the API
premium_search_args = load_credentials(twitter_keys_loc,
                                       yaml_key="search_tweets_api",
                                       env_overwrite=False)
print(premium_search_args)


# Set tweet extraction period and create a list of days of interest
fromDate = "2020-03-16"
toDate = "2020-03-18"

daysList = [fromDate]

while fromDate != toDate:
    date = datetime.strptime(fromDate, "%Y-%m-%d")
    mod_date = date + timedelta(days=2)
    incrementedDay = datetime.strftime(mod_date, "%Y-%m-%d")
    daysList.append(incrementedDay)
    
    fromDate = incrementedDay

# Retrieve the data for each day from the API
for day in daysList:
    
    dayNhourList = twf.createDateTimeFrame(day, hourSep=2)
    
    for hs in dayNhourList:
        fromDate = hs[0]
        toDate = hs[1]
        # Create the searching rule for the stream
        rule = gen_rule_payload(pt_rule=configFile['search_query'],
                                from_date=fromDate,
                                to_date=toDate ,
                                results_per_call = 100)

        # Set up the stream
        rs = ResultStream(rule_payload=rule,
                            max_results=100,
                            **premium_search_args)

        # Create a .jsonl with the results of the Stream query
        #file_date = datetime.now().strftime('%Y_%m_%d_%H_%M')
        file_date = '_'.join(hs).replace(' ', '').replace(':','')
        filename = os.path.join(configFile["outputFiles"],
                                f'twitter_30day_results_{file_date}.jsonl')
    
        # Write the data received from the API to a file
        with open(filename, 'a', encoding='utf-8') as f:
            cntr = 0
            for tweet in rs.stream():
                cntr += 1
                if cntr % 100 == 0:
                    n_str, cr_date = str(cntr), tweet['created_at']
                    print(f'\n {n_str}: {cr_date}')
                    json.dump(tweet, f)
                    f.write('\n')
        print(f'Created file {f}:')

At this stage we have leveraged the TwitterAPI in order to retrieve tweets relevant to coronavirus for a specific period of time. As the free version of TwitterAPI does not allow us to retrieve as much data as we want, we had to find a workaround on how to collect the dataset. Therefore, we made multiple calls to the API, each time targeting a different day and time. 

Specifically, in order to make our data collection less biased to a specific day (for example avoid multiple tweets refering to the same news for the coronavirus) we made four calls for each day in our analysis, each call targeting a specific time of the day.


In [9]:
# Path which contains the created jsonl files
jsonl_files_folder = os.path.join(configFile["project_directory"], configFile["outputFiles"])

In [10]:
def loadJsonlData(file: str) -> list:
    '''
    Reads the data as saved in a .jsonl file
    
    Args:
    ----
    file: String corresponding to the path to a .jsonl file which contains the 
          tweets as received from the TwitterAPI.

    Returns:
    -------
    tweets: A list of all the data saved in the .jsonl file.
    '''
    
    tweets = []
    with open(file, 'rb') as f:
        for tweet in json_lines.reader(f, broken=True):
            try:
                tweets.append(tweet)
            except json_lines.UnicodeDecodeError or json.JSONDecodeError:
                pass

        return tweets

In [12]:
# List that will contain all the Tweets that we managed to receive via the use of the API
import json_lines
allTweetsList = []

for file in os.listdir(jsonl_files_folder):
    if 'twitter' in file:
        tweets_full_list = loadJsonlData(os.path.join(jsonl_files_folder, file))
        allTweetsList += tweets_full_list

At this point we have collected all the tweets and created a merged list that contains all the relevant information.Each separate case of tweets contains a variety of information, like the name/id of the person who made the tweets, their location, time, and many more. We can have a look inside the first case, and decide which information seems relevant for our analysis.

In [None]:
allTweetsList[1]

As we can see in the example above, there is a wide variety of information that we can focus on. For our analysis, we will proceed further with the users **screen name**, **id**, **location**, **day/time** of the tweet, the **tweet** that they made, and the name of the person that they are replying to (if any).

###  Data preprocessing and cleaning

In [14]:
def removeURL(text: str) -> str:
    '''
    Removes URLs (strings that start with 'http\\ or htpps\\) from text.
    
    Args:
    -----
    text: Input string the we want to remove the URL from.
     
    Returns:
    -------
    text: The input string clean from any URL.
    '''

    regex = r'http[0-9a-zA-Z\\/.:]+.'
    urllinks = re.findall(regex, text)
    if  urllinks != []:
        for url in urllinks:
            print(f'String removed: {url}')
            if type(url) is tuple:
                url = [x for x in url if x != '']
            try:
                text = text.replace(url,'')
            except TypeError:
                continue
        return text
    else:
        pass

At this point we will create a dataframe based on the relevant data from the full list of received tweets. There is plenty of information in the data received, but we will focus only in a few attributes (features) to create our dataframe.

In [None]:
user_ls, userid_ls, tweet_ls = [], [], []
location_ls, datetime_ls, replyto_ls = [], [], []
geo_loc_ls = []

for tweet_dict in allTweetsList:
    user_ls.append(tweet_dict['user']['screen_name'])
    userid_ls.append(tweet_dict['user']['id'])
    tweet_ls.append(twf.removeURL(tweet_dict['text']))
    replyto_ls.append(tweet_dict['in_reply_to_user_id'])
    location_ls.append(tweet_dict['user']['location'])
    datetime_ls.append(tweet_dict['created_at'])
    geo_loc_ls.append(tweet_dict['geo'])

In [16]:
# Dataframe that contains the data for analysis
# Note: The twitter API functionality is very broad in what data we can analyse
# This project will focus on tweets and with their respective location/date.
df = pd.DataFrame(list(zip(user_ls, userid_ls, tweet_ls,
                           replyto_ls, location_ls, datetime_ls, geo_loc_ls)),
                  columns=['Username', 'UserID', 'Tweet',
                           'Reply_to', 'Location', 'Date', 'Coordinates'])

Finally, we remove any data that do not contain tweets (empty text) - as they are not relevent for the scope of this project. Unfortunatey the free versio of TwitterAPI does not provide the option to filter out for tweets that do not contain text. This is an option that is available only for paid subscriptions.

In [17]:
# Remove tweets that they did not have any text
df = df[df['Tweet'].notnull()].reset_index()
df.drop(columns=['index'], inplace = True)

Some tweets are not written in the English language, so we are going to translate them by using the Google translate API.

In [18]:
def translateTweet(text: str) -> str:
    '''
    If Tweets are written in any other language than English, translate to
    English and return the translated string.
    '''
    translator = Translator(service_urls=['translate.google.com'])
    try:
        textTranslated = translator.translate(text, dest='en').text
    except json.JSONDecodeError:
        textTranslated = text
        pass
    return textTranslated

In [19]:
# Detect language and translate if necessary
df['Tweet'] = df['Tweet'].apply(lambda text: translateTweet(text))

In [20]:
df.to_csv('tweetsdata20200321.csv', sep='\t', encoding='utf-8')

In [23]:
df = pd.read_csv('tweetsdata20200321.csv', sep='\t', encoding = 'utf-8', index_col=0)

One good idea would be to attempt and plot the tweets in a geographical map. In order to do that we would need the coordinates (long/lat) of the tweets, something that is not available in most of the data extracted. That said, we can observe that most of the tweets have a 'Location' tag, that we can use and reverse engineer the long/lat data from the location string. For example, if a tweet has 'London' as a location string, we will transform this string to a corresponding longitude. latitude pair. 

For this purpose we will use the functionality that the geopy package has to offer, as below:

In [28]:
# Geopy has a limit in the times we can call it per second so we have to find a workaround

geolocator = Nominatim(user_agent="https://developer.twitter.com/en/apps/17403833") 
   
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1,
                      max_retries=3, error_wait_seconds=2)

In [None]:
# Split it in batches and identify the locations
step = 100

for batch in range(0, df.shape[0], step):
    batchstep = batch+step
    if batchstep > df.shape[0]:
        batchstep = batch + (df.shape[0]%step)
    print(f'\nCalculating batch: {batch}-{batchstep}\n')
    df['Point'] = df['Location'][batch:batchstep].apply(lambda 
                                   loc: twf.getValidCoordinates(loc, geolocator))

In [None]:
dfWithCoords = df[df['Point'].notnull()]
dfWithCoords['Latitude'] = dfWithCoords['Point'].apply(lambda x: x[0])
dfWithCoords['Longitude'] = dfWithCoords['Point'].apply(lambda x: x[1])

Finally, we can create a function that will use the Longitude and Latitude values retrieved above, and plot them as a map.

In [31]:
from plotly import graph_objs as go

def createTweetWorldMap(df: pd.core.frame.DataFrame):
    '''
    Given dataframe that contains columns corresponding to Longitude and Latitude,
    create a world map plot and mark the Tweet locations on the map.
    
    '''
    df['Text'] = df['Date'] + ': \n' + df['Tweet']
    
    fig = go.Figure(data=go.Scattergeo(lon = df['Longitude'],
                                       lat = df['Latitude'],
                                       text = df['Text'],
                                       mode = 'markers',
                                       marker = dict(
                                           symbol = 'circle',
                                           line = dict(
                                               width=1,
                                               color='rgba(102, 102, 102)'
                                           ),
                                           colorscale = 'Viridis',
                                           cmin = 0,
            )))
    
    fig.update_layout(title = 'COVID-19 related Tweets across the world (January 2020 - March 2020) ',
                      geo_scope='world',
                      geo = dict(
                          resolution = 110,
                          scope = 'world',
                          showland = True,
                          landcolor = "rgb(250, 250, 250)",
                          subunitcolor = "rgb(217, 217, 217)",
                          countrycolor = "rgb(217, 217, 217)",
                          countrywidth = 0.6,
                          subunitwidth = 0.6,
                      ))
    return fig

In [None]:
fig = pmap.createTweetWorldMap(dfWithCoords)
plot(fig)