# Exploratory Data and Sentiment Analysis on COVID-19 related tweets through TwitterAPI

### Author: George Spyrou
### Date: 01/03/2020

Purpose of this project is to leverage the TwitterAPI functionality offered by Twitter, and conduct an analysis on tweets that are related with the COVID-19 virus. Initially, this project started as an exploratory task to learn how the TwitterAPI can be used to retrieve data (tweets) from the web, and how to use the tweepy and searchtweets python packages. 

After I managed to retrieve the data, I found myself really interested into performing some data analysis on the retrieved tweets. COVID-19 - or as it's been commonly known as coronavirus - is one of the most discussed topics in Twitter for the period 01/01/2020 - 01/03/2020. Using relevant tweets we want to perform some exploratory data analysis (e.g. find the most common words used in tweets related to covid-19 or identify the bigrams) and then attempt to identify the sentiment of the tweets by using a variety of methods.

#### Version 1: The first version has been completed on 01/03/2020 and it includes analysis on:
    - Most common words present in tweets.
    - Most common bigrams (i.e. pairs of words that often appear next to each other).
    - Sentiment analysis by using the Liu Hu opinion lexicon algorithm.
    
At the first part of the project, we deal with setting up the environment required for our analysis, as well as retrieving the data by using the TwitterAPI.

In [22]:
# Import dependencies
import os
import re
import json
import json_lines
import pandas as pd

# Plots and graphs
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the project environment

# Secure location of the required keys to connect to the API
# This config also contains the search query (in this case 'coronavirus')
json_loc = '/Users/georgiosspyrou/Desktop/config_tweets/Twitter/twitter_config.json'

with open(json_loc) as json_file:
    data = json.load(json_file)

# Project folder location and keys
os.chdir(data["project_directory"])

For this project we had to create a variety of functions, some of which have been used in order to retrieve/clean the data, as well as the functions that we have used for our main analysis and plotting. For more information regarding this functions, please refer to the **twitterCustomFunc.py** file.

In [None]:
# Import the custom functions that we will use to retrieve and analyse the data

import twitterCustomFunc as twf

twitter_keys_loc = data["keys"]

# Load the credentials to get access to the API
premium_search_args = load_credentials(twitter_keys_loc,
                                       yaml_key="search_tweets_api",
                                       env_overwrite=False)
print(premium_search_args)

# Set tweet extraction period and create a list of days of interest
fromDate = "2020-02-21"
toDate = "2020-02-25"

daysList = [fromDate]

while fromDate != toDate:
    date = datetime.strptime(fromDate, "%Y-%m-%d")
    mod_date = date + timedelta(days=1)
    incrementedDay = datetime.strftime(mod_date, "%Y-%m-%d")
    daysList.append(incrementedDay)
    
    fromDate = incrementedDay

# Retrieve the data for each day from the API
for day in daysList:
    
    dayNhourList = twf.createDateTimeFrame(day, hourSep=2)
    
    for hs in dayNhourList:
        fromDate = hs[0]
        toDate = hs[1]
        # Create the searching rule for the stream
        rule = gen_rule_payload(pt_rule=data['search_query'],
                                from_date=fromDate,
                                to_date=toDate ,
                                results_per_call = 100)

        # Set up the stream
        rs = ResultStream(rule_payload=rule,
                            max_results=100,
                            **premium_search_args)

        # Create a .jsonl with the results of the stream query
        file_date = '_'.join(hs).replace(' ', '').replace(':','')
        filename = os.path.join(data["outputFiles"],f'twitter_30day_results_{file_date}.jsonl')
    
        # Write the data received from the API to a file
        with open(filename, 'a', encoding='utf-8') as f:
            cntr = 0
            for tweet in rs.stream():
                cntr += 1
                if cntr % 100 == 0:
                    n_str, cr_date = str(cntr), tweet['created_at']
                    print(f'\n {n_str}: {cr_date}')
                    json.dump(tweet, f)
                    f.write('\n')
        print(f'Created file {f}:')

At this stage we have leveraged the TwitterAPI in order to retrieve tweets relevant to coronavirus for a specific period of time. As the free version of TwitterAPI does not allow us to retrieve as much data as we want, we had to find a workaround on how to collect the dataset. Therefore, we made multiple calls to the API, each time targeting a different day and time. 

Specifically, in order to make our data collection less biased to a specific day (for example avoid multiple tweets refering to the same news for the coronavirus) we made four calls for each day in our analysis, each call targeting a specific time of the day.


In [14]:
# Path which contains the created jsonl files
jsonl_files_folder = os.path.join(data["project_directory"], data["outputFiles"])

In [15]:
def loadJsonlData(file: str) -> list:
    '''
    Reads the data as saved in a .jsonl file
    
    Args:
    ----
    file: String corresponding to the path to a .jsonl file which contains the 
          tweets as received from the TwitterAPI.

    Returns:
    -------
    tweets: A list of all the data saved in the .jsonl file.
    '''
    
    tweets = []
    with open(file, 'rb') as f:
        for tweet in json_lines.reader(f, broken=True):
            try:
                tweets.append(tweet)
            except json_lines.UnicodeDecodeError or json.JSONDecodeError:
                pass

        return tweets

In [16]:
# List that will contain all the Tweets that we managed to receive via the use of the API
allTweetsList = []

for file in os.listdir(jsonl_files_folder):
    if 'twitter' in file:
        tweets_full_list = loadJsonlData(os.path.join(jsonl_files_folder, file))
        allTweetsList += tweets_full_list

At this point we have collected all the tweets and created a merged list that contains all the relevant information.Each separate case of tweets contains a variety of information, like the name/id of the person who made the tweets, their location, time, and many more. We can have a look inside the first case, and decide which information seems relevant for our analysis.

In [17]:
allTweetsList[1]

{'created_at': 'Mon Jan 27 15:59:59 +0000 2020',
 'id': 1221825187718385664,
 'id_str': '1221825187718385664',
 'text': 'RT @elmnzhri: #Coronavirus spreading in our country\n\nThe government: https://t.co/tR5kpqENgk',
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 303054216,
  'id_str': '303054216',
  'name': 'Munirah',
  'screen_name': 'SyuhadaWhoaa',
  'location': 'Kuala Lumpur',
  'url': 'http://Instagram.com/Syuhadawhoaa',
  'description': 'KL',
  'translator_type': 'none',
  'protected': False,
  'verified': False,
  'followers_count': 811,
  'friends_count': 452,
  'listed_count': 4,
  'favourites_count': 1211,
  'statuses_count': 63261,
  'created_at': 'Sun May 22 07:02:07 +0000 2011',
  'utc_offset': None,
  'time_zone': None,
  

As we can see in the example above, there is a wide variety of information that we can focus on. For our analysis, we will proceed further with the users **screen name**, **id**, **location**, **day/time** of the tweet, the **tweet** that they made, and the name of the person that they are replying to (if any).

###  Data preprocessing and cleaning

In [23]:
def removeURL(text: str) -> str:
    '''
    Removes URLs (strings that start with 'http\\ or htpps\\) from text.
    
    Args:
    -----
    text: Input string the we want to remove the URL from.
     
    Returns:
    -------
    text: The input string clean from any URL.
    '''

    regex = r'http[0-9a-zA-Z\\/.:]+.'
    urllinks = re.findall(regex, text)
    if  urllinks != []:
        for url in urllinks:
            print(f'String removed: {url}')
            if type(url) is tuple:
                url = [x for x in url if x != '']
            try:
                text = text.replace(url,'')
            except TypeError:
                continue
        return text
    else:
        pass

In [24]:
# Create a dataframe based on the relevant data from the full list of received tweets

user_ls, userid_ls, tweet_ls = [], [], []
location_ls, datetime_ls, replyto_ls = [], [], []

for tweet_dict in allTweetsList:
    user_ls.append(tweet_dict['user']['screen_name'])
    userid_ls.append(tweet_dict['user']['id'])
    tweet_ls.append(removeURL(tweet_dict['text']))
    replyto_ls.append(tweet_dict['in_reply_to_user_id'])
    location_ls.append(tweet_dict['user']['location'])
    datetime_ls.append(tweet_dict['created_at'])

String removed: https://t.co/Zw8a4VJ5ot
String removed: https://t.co/tR5kpqENgk
String removed: https://t.co/nhjkKxu7By
String removed: https://t.co/rnCuXBTxw5 
String removed: https://t.co/IZYJqlq3l0
String removed: https://t.co/hNtVBqK7dL
String removed: https://t.co/mEywxUwG1u
String removed: https://t.…
String removed: https://t.co/YzcRsaHfIl
String removed: https://t.co/izSGrGxf8P
String removed: https://t.co/cmCdSR1I6Y
String removed: https://t.co/PvQkMknhgM
String removed: https://t.co/RC1og3WkhW
String removed: https://t.co/ehwAZMouqC
String removed: https://t.co/ajiNZF9BRF
String removed: https://t.co/r6SR12isXI
String removed: https://t.co/98rpIqs7uh
String removed: https://t.co/sAMmKpBw6F
String removed: https://t.co/0y7oCNNHmL
String removed: https://t.co/c7FvrI4KAR
String removed: https://t.co/5gAbJpjHaO
String removed: https://t.co/n5Vc2qsU9O
String removed: https://t.co/xllZ5wbG15
String removed: https://t.co/uVruLZ2IHv 
String removed: https://t.co/lvYUQNfqgs
String rem

String removed: https://t.co/dAQjKPawBH 
String removed: https://t.co/lCjgQ2ZTVK
String removed: https://t.co/xQ8z2ihrAu
String removed: https…
String removed: https://t.co/xQ8z2ihrAu
String removed: https://t.co/xQ8z2ihrAu
String removed: https://t.co/xQ8z2ihrAu
String removed: https://t.co/QRE32EJDC6
String removed: https://t.co/xB7277sNXZ 
String removed: https://t.co/kLAwQovih2
String removed: https://t.co/WvgheTUSe2
String removed: https://t.co/xQ8z2ihrAu
String removed: https://t.co/MxHvNNd40C
String removed: https://t.co/xQ8z2ihrAu
String removed: https://t.co/LFHwxaS4NN
String removed: https://t.co/cYJuodjqtu 
String removed: https://t.co/4yygeKjp8o 
String removed: https://t.co/oNyK8ohvOp
String removed: https://t.co/nU3sIqTbvz
String removed: https://t.co/OdplI75hdo
String removed: https://t.co/RJ0KO10yva
String removed: https://t.co/MwT…
String removed: https://t.co/d6horvhygs
String removed: https://t.co/EpXqnVMr79
String removed: https://t.co/aUppWXABjF
String removed: htt

String removed: https://t.co/M34…
String removed: https://t.co/9a2RQVvVAi
String removed: https://t.co/am8enLXNGp 
String removed: https://t.co/pXFQ6MNAc4 
String removed: https://t.co/CeayqDtK4I
String removed: https://t.co/fCVGWXDcNN 
String removed: https://t.co/6s2MOPC37h
String removed: https://t.co/MyR7ImRiTE
String removed: https://t.co/XVqsL5IM2g
String removed: https://t.co/3DHLedfWpp
String removed: https://t.co/9SNHwZVEjI
String removed: https://t.co/YsdAgnpTv9
String removed: https://t.co/vYUHA4zrZS
String removed: https://t.co/3mACZOGMyn
String removed: https://t.co/oUcRTBl4m6
String removed: https://t.co/XyqaV2plIn
String removed: https://t.co/4ssJICNU7P
String removed: https://t.co/9SPFNZVRId
String removed: https://t.co/c5up1cHzZZ
String removed: https://t.co/gm9k6iSL57
String removed: https://t.co/kqHSXss9ef 
String removed: https://t.co/…
String removed: https://t.co/bFWibgNdNq
String removed: https://t.co/RqA1nV31Ii
String removed: https://t.co/92tww3jw2E 
String rem