# Exploratory Data and Sentiment Analysis on COVID-19 related tweets through the TwitterAPI

### Author: George Spyrou
### Date: 01/03/2020

Purpose of this project is to leverage the TwitterAPI functionality offered by Twitter, and conduct an analysis on tweets that are related with the COVID-19 virus. Initially, this project started as an exploratory task to learn how the TwitterAPI can be used to retrieve data (tweets) from the web, and how to use the tweepy and searchtweets python packages. 

After I managed to retrieve the data, I found myself really interested into performing some data analysis on the retrieved tweets. COVID-19 - or as it's been commonly known as coronavirus - is one of the most discussed topics in Twitter for the period 01/01/2020 - 01/03/2020. Using relevant tweets we want to perform some exploratory data analysis (e.g. find the most common words used in tweets related to covid-19 or identify the bigrams) and then attempt to identify the sentiment of the tweets by using a variety of methods.

#### Version 1: The first version has been completed on 01/03/2020 and it includes analysis on:
    - Most common words present in tweets.
    - Most common bigrams (i.e. pairs of words that often appear next to each other).
    - Sentiment analysis by using the Liu Hu opinion lexicon algorithm.
    
At the first part of the project, we deal with setting up the environment required for our analysis.

In [1]:
# Import dependencies
import os
import json
import pandas as pd

# Plots and graphs
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the project environment

# Secure location of the required keys to connect to the API
# This config also contains the search query (in this case 'coronavirus')
json_loc = '/Users/georgiosspyrou/Desktop/config_tweets/Twitter/twitter_config.json'

with open(json_loc) as json_file:
    data = json.load(json_file)

# Project folder location and keys
os.chdir(data["project_directory"])

For this project we had to create a variety of functions, some of which have been used in order to retrieve/clean the data, as well as the functions that we have used for our main analysis and plotting. For more information regarding this functions, please refer to the **twitterCustomFunc.py** file.

In [None]:
# Import the custom functions that we will use to retrieve and analyse the data

import twitterCustomFunc as twf

twitter_keys_loc = data["keys"]

# Load the credentials to get access to the API
premium_search_args = load_credentials(twitter_keys_loc,
                                       yaml_key="search_tweets_api",
                                       env_overwrite=False)
print(premium_search_args)

# Set tweet extraction period and create a list of days of interest
fromDate = "2020-02-21"
toDate = "2020-02-25"

daysList = [fromDate]

while fromDate != toDate:
    date = datetime.strptime(fromDate, "%Y-%m-%d")
    mod_date = date + timedelta(days=1)
    incrementedDay = datetime.strftime(mod_date, "%Y-%m-%d")
    daysList.append(incrementedDay)
    
    fromDate = incrementedDay

# Retrieve the data for each day from the API
for day in daysList:
    
    dayNhourList = twf.createDateTimeFrame(day, hourSep=2)
    
    for hs in dayNhourList:
        fromDate = hs[0]
        toDate = hs[1]
        # Create the searching rule for the stream
        rule = gen_rule_payload(pt_rule=data['search_query'],
                                from_date=fromDate,
                                to_date=toDate ,
                                results_per_call = 100)

        # Set up the stream
        rs = ResultStream(rule_payload=rule,
                            max_results=100,
                            **premium_search_args)

        # Create a .jsonl with the results of the Stream query
        #file_date = datetime.now().strftime('%Y_%m_%d_%H_%M')
        file_date = '_'.join(hs).replace(' ', '').replace(':','')
        filename = os.path.join(data["outputFiles"],f'twitter_30day_results_{file_date}.jsonl')
    
        # Write the data received from the API to a file
        with open(filename, 'a', encoding='utf-8') as f:
            cntr = 0
            for tweet in rs.stream():
                cntr += 1
                if cntr % 100 == 0:
                    n_str, cr_date = str(cntr), tweet['created_at']
                    print(f'\n {n_str}: {cr_date}')
                    json.dump(tweet, f)
                    f.write('\n')
        print(f'Created file {f}:')

In [None]:


# Read the data from the created jsonl files
jsonl_files_folder = os.path.join(data["project_directory"], data["outputFiles"])

In [None]:
# List that will contain all the Tweets that we managed to receive via the use of the API
allTweetsList = []

for file in os.listdir(jsonl_files_folder):
    if 'twitter' in file:
        tweets_full_list = twf.loadJsonlData(os.path.join(jsonl_files_folder, file))
        allTweetsList += tweets_full_list