# Exploratory Data and Sentiment Analysis on COVID-19 related Tweets via the TwitterAPI

### Author: George Spyrou
### Date: 26/09/2020

<img src="../img/sentiment_image.jpeg" alt="Sentiment Picture" width="600" height="400">

## Sections

- <a href='#project_idea' style="text-decoration: none">Introduction</a>
- <a href='#data_retrieval' style="text-decoration: none">Part 1: Twitter API and Data Retrieval</a>

<a id='project_idea'></a>
## Project Idea

Purpose of this project is to leverage the TwitterAPI functionality offered by Twitter, and conduct an analysis on tweets that are related with the SARS-CoV-2 virus - or as it's widely known as **COVID-19**. From now on and for an ease of use we will refer to the virus with the latter name, which in reality is the name of the decease. 

Initially, this project started as an exploratory task to learn how the TwitterAPI can be used to retrieve data (tweets) from the web, and how to use the tweepy and searchtweets python packages.

After I managed to retrieve the data, I found myself really interested into digging deeper and get a better understanding of the data and they information they contain. COVID-19 is one of the most discussed topics in Twitter - or any other social media platform - roughly since the virus was first discovered in December of 2019. Because of the nature of this topic, we would expect that people have different opinions - some people feel more scared about the virus, some people do not even believe that the virus exists in the first place. Hence, I thought it would be interesting to see how the **overall** sentinemt of people's opinions about the virus is changing in different time periods during the year.

I believe most people would agree that we are expecting that during the period January 2020 - March 2020, whilst the virus was not widely spread yet, people would not feel afraid and the sentiment would most likely be **neutral**. On the other hand, after April 2020, when most of the countries ended up having a lockdown and the virus became a reality for everyone, we would expect that the sentiment would be more **negative**.

Now before we move further we have to make sure that we understand how we will _measure_ the sentiment and what it means for a Tweet to be **positive**/**neutral**/**negative**. The idea is actually pretty simple, as we will be looking at individual words that tweets consist of and try to understand the sentiment of the tweet as a whole. As a quick example, if a tweet contains words like 'death', 'decease' it's more likely to get a negative sentiment, compared to a tweet containing words like 'cure' and 'healed'.


#### Version 1: The first version has been completed on 01/03/2020 and it includes analysis on:
- Most common words present in tweets.
- Most common bigrams (i.e. pairs of words that often appear next to each other).
- Sentiment analysis by using the Liu Hu opinion lexicon algorithm.
    

<a id="data_retrieval"></a>
## Part 1 - Twitter API and Data Retrieval

At the first part of this project, we are going to deal with setting up the environment required for the analysis, as well as retrieving the data by using the TwitterAPI and leveraging the awesome _searchtweets_ package (https://pypi.org/project/searchtweets/) to connect with the API.

Below I am going to present the script that I have used to get Tweets for different time intervals. At this point it's necessary to mention that I have used the free tier for the TwitterAPI - which of course it's coming with some limitations. One major was the amount of data that I can retrieve per month, as the free tier is providing the following:
1. 25k tweets for the 30day tier (i.e. retrieve data from max 30 days ago from the moment you are making the API call)
2. 5k tweets for the full archive (i.e. retrieve data from any day during the year)

Later in the project we will discuss further about some other limitations of the free tier. Below I am presenting the script that I have used to make the API calls. Please note that this script can not run in a jupyter notebook instance, as it's set up to run from the command line. Either way I chose to present it as it can be useful to see the logic of how to set up the python script to make the API calls.

This script has been formatted in a way that it's trying to get data during different times during the day. I have done this so that we "randomize" the tweets as much as possible, because we wanted to avoid cases where for example we would receive all the data from a Monday morning, where the news/tweets/etc would most probably talk about the same topic.

In [None]:
import os
import argparse
import json
from datetime  import datetime, timedelta

# Twitter API
from searchtweets import load_credentials
from searchtweets import gen_rule_payload
from searchtweets import ResultStream


# Secure location of the required keys to connect to the API
# This config also contains the search query
json_loc = r'D:\GitHub\Projects\Twitter_Project\Twitter_Topic_Modelling\twitter_config.json'

with open(json_loc) as json_file:
    config = json.load(json_file)

# Project folder location and keys
os.chdir(config["project_directory"])

# Custom functions created for the project
import twitter_custom_functions as tcf

keys_yaml_location = config["keys"]

# Load the credentials to get access to the API
premium_search_args = load_credentials(filename=keys_yaml_location,
                                       yaml_key="search_tweets_api_fullarchive",
                                       env_overwrite=False)
print(premium_search_args)

# Set tweet extraction period 
parser=argparse.ArgumentParser()
parser.add_argument('fromDate', type=str)
parser.add_argument('toDate', type=str)
args = parser.parse_args()

if args.toDate <= args.fromDate:
    print('The date range given is invalid. Please give correct from/to dates')
    exit()

daysList = [args.fromDate]

print(f'Collecting Tweets from: {args.fromDate} to {args.toDate}')

while args.fromDate != args.toDate:
    date = datetime.strptime(args.fromDate, "%Y-%m-%d")
    mod_date = date + timedelta(days=1)
    incrementedDay = datetime.strftime(mod_date, "%Y-%m-%d")
    daysList.append(incrementedDay)
    
    args.fromDate = incrementedDay
    
# Retrieve the data for each day from the API
for day in daysList:
    
    dayNhourList = tcf.create_date_time_frame(day, hourSep=2)
    
    for hs in dayNhourList:
        fromDate = hs[0]
        toDate = hs[1]
        # Create the searching rule for the stream
        rule = gen_rule_payload(pt_rule=config['search_query'],
                                from_date=fromDate,
                                to_date=toDate ,
                                results_per_call=100)

        # Set up the stream
        rs = ResultStream(rule_payload=rule, max_results=100,
                          **premium_search_args)

        # Create a .jsonl with the results of the Stream query
        #file_date = datetime.now().strftime('%Y_%m_%d_%H_%M')
        file_date = '_'.join(hs).replace(' ', '').replace(':','')
        filename = os.path.join(config["outputFiles"],
                                f'twitter_30day_results_{file_date}.jsonl')
    
        # Write the data received from the API to a file
        with open(filename, 'a', encoding='utf-8') as f:
            cntr = 0
            for tweet in rs.stream():
                cntr += 1
                if cntr % 100 == 0:
                    n_str, cr_date = str(cntr), tweet['created_at']
                    print(f'\n {n_str}: {cr_date}')
                json.dump(tweet, f)
                f.write('\n')
        print(f'Created file {f}:')

The script above had to be run *many* times, and each time the output were multiple jsonl files containing the data retrieved for the specified timeframe. Note that the timeframe was usually 2 or 3 days in one call. The reason for that is that there is a **limit** of tweets you can receive in one API call, as well as the number of calls you can make in an hour. 

As you can imagine, the process above has generated hundrends of json files, each one corresponding to a _specific_ time period of a _specific_ day. To make our life easier, we have created a script to merge the raw json files into a single .txt file that will be getting updated every time we receive new day and re-run the script. I am not going to present the code that completes this job here, but if you are interested in it you can find it <a href="https://github.com/gpsyrou/Twitter_Topic_Modelling/blob/master/utilities/merge_json_files.py" alt="link_merge_json" style="text-decoration: none" >here</a>.