### Purpose of Script

In this script, I'll build off the previous hydrate scripts and hydrate tweets from 2020-12-22 to 2021-01-10. This should line up with news about the new COVID strain from England as well as vaccine-related news. The tweets will be sourced from https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tweets-dataset

In [16]:
import numpy as np
import pandas as pd
import sys
import os
import json
import datetime as datetime
import re
import nltk 
from nltk.corpus import stopwords
import emoji

pd.set_option('display.max_columns', None) # show all columns

#### 1. Hydrate tweets
In this part of the code, we'll take the .csv files from the website above and get the IDs. We'll do this for all the IDs, then export all the IDs from December 22nd to January 11th as a .csv file.

In [4]:
TWEET_ID_DIR = "../../data/tweets/tweet_ids/"

In [5]:
def get_tweets_to_hydrate(link):
    
    """
        Takes the links to both of the csv files for the given date, as well as name of export file
        
        Assumes that directory for tweet IDs is specified
        
    """
    
    df = pd.read_csv(link, names=["tweet_id", "sentiment_score"])
    
    df.drop_duplicates(inplace=True)
    
    tweet_ids = list(df["tweet_id"])
    
    return tweet_ids

In [6]:
def save_tweet_IDs(tweet_ids, filepath):
    """
        Takes list of tweet IDs, exports as .csv
    """
    
    with open(filepath, "a+") as f:
        for idx, tweet in enumerate(tweet_ids):
            if idx != len(tweet_ids) - 1:
                f.write(f"{tweet}, \n")
            else:
                f.write(f"{tweet}")
                
    print(f"CSV file successfully exported")

In [8]:
links_list = ["https://ieee-dataport.s3.amazonaws.com/open/14206/december21_december22.csv?response-content-disposition=attachment%3B%20filename%3D%22december21_december22.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=607f3b90fef5da9626ddd82e5c2361c57f6f8e5cc9ecbcb2a23f46dac5caeccf", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december22_december23.csv?response-content-disposition=attachment%3B%20filename%3D%22december22_december23.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=ec8acbb3ccd2dd3e099632b6bd5dec39d35feca56a9caac975499236bd3a6c87", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december23_december24.csv?response-content-disposition=attachment%3B%20filename%3D%22december23_december24.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=b974795c225f98049c82e637990468413f66c6ca80e4833029e68bd7e91c80e2", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december24_december25.csv?response-content-disposition=attachment%3B%20filename%3D%22december24_december25.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=3c0978afccb0388ff13fe1ff61052c904251ba4447465a580f109b21d714d698", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december25_december26.csv?response-content-disposition=attachment%3B%20filename%3D%22december25_december26.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=3b760e6cc10f7fc8d66bbdfc37bad9f677fa001c099fafe700a56b1e36e56a26", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december26_december27.csv?response-content-disposition=attachment%3B%20filename%3D%22december26_december27.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=14409a2f99696c8f1a8ab90fecf5a0fefe587d43aabb5aef0d197aefc42ac137", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december27_december28.csv?response-content-disposition=attachment%3B%20filename%3D%22december27_december28.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=3cc145bddf4eb7d8eac5273901c1f7af73018ca1dbbc7b7fb4ee14102f24eda7", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december28_december29.csv?response-content-disposition=attachment%3B%20filename%3D%22december28_december29.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=1e86cf986d021a3cbbd360f15fa74f0ad57431fc8ce39a4fbd4697e415fbed9a", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december29_december30.csv?response-content-disposition=attachment%3B%20filename%3D%22december29_december30.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=4d66f267117abae9c56e9c96cce22d42376b8a941bd8b6816203ec263cf3eabb", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december30_december31.csv?response-content-disposition=attachment%3B%20filename%3D%22december30_december31.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=23843c165f6cd122a22f705119bb178927f9847949193a6ee822d29f48a73497", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/december31_january1.csv?response-content-disposition=attachment%3B%20filename%3D%22december31_january1.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=48361c629ee064426bc6f4c199318d2d9f831082713d12ffbde9d71ef1f1f283", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/january1_january2.csv?response-content-disposition=attachment%3B%20filename%3D%22january1_january2.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=cf8cff84bf89a487aa483f731fe39b08c835f64840aaaa0802a0d786adc3b220", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/january2_january3.csv?response-content-disposition=attachment%3B%20filename%3D%22january2_january3.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=0e9a099b0fb3bccc316326e4b1fbf0b1ce43aee048a3a626e6772b6bf4b162ad", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/january3_january4.csv?response-content-disposition=attachment%3B%20filename%3D%22january3_january4.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=387d4c2111e776d9811255bc9144411ddaf79680c7cc0a18611198b7112bb645", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/january4_january5.csv?response-content-disposition=attachment%3B%20filename%3D%22january4_january5.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=4052f80aa36eac3513134ae5e669adfcb984d98747ddaa0f11234d65d51f4fff", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/january5_january6.csv?response-content-disposition=attachment%3B%20filename%3D%22january5_january6.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=87cc104e237a442db620be49a05a0031d26da3ef32b8c7fc67a57a40619635c9", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/january6_january7.csv?response-content-disposition=attachment%3B%20filename%3D%22january6_january7.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=b124da74209b9a132248debc51a7811a21518509bd66169730007f1029facb99", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/january7_january8.csv?response-content-disposition=attachment%3B%20filename%3D%22january7_january8.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=df111623cc460870fcf5642bbbc2d5acad64cd7a75d9547dd02074ded3bae9fd", 
              "https://ieee-dataport.s3.amazonaws.com/open/14206/january8_january9.csv?response-content-disposition=attachment%3B%20filename%3D%22january8_january9.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20210111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210111T151953Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=a38b30deac63e489a173255d830bb5073367fef6dfd2e6d2dcbfc9a6d0d4c76f"]

In [9]:
IDs_list = []

In [10]:
for lst in links_list:
    ids_to_add = get_tweets_to_hydrate(lst)
    for elem in ids_to_add:
        IDs_list.append(elem)

In [11]:
save_tweet_IDs(IDs_list, TWEET_ID_DIR + "tweet_IDs_2020-12-22_2021-01-09.csv")

CSV file successfully exported


Now, using these tweet IDs, let's hydrate them to recover the original tweets

First, you have to confirm your credentials.

`twarc configure`

Then, submit the creds. After doing so successfully, you should get a message like this:

`The credentials for default have been saved to your configuration file at /Users/mark/.twarc`

Afterwards, you can start hydrating the tweets.

This can be done in the command line

You'd run something like this:

`twarc hydrate ids.txt > tweets.jsonl`

In my case, running the command from the root directory of this project, it looks something like this:

`twarc hydrate data/tweets/tweet_ids/tweet_IDs_2020-12-22_2021-01-09.csv > data/tweets/hydrated_tweets/2020-12-22_2021-01-09_tweets.jsonl`



#### 2. Preprocess Tweets

In [13]:
def get_state_from_location(place):
    """
    Gets state info from place field
    Assumes dict input
    """
    
    if place is None:
        state = "NA"  
    elif place["country_code"] != "US":
        state = "NA"
    else:
        state = place["full_name"].split(",")[1].strip() # e.g., "Los Angeles, CA" --> "CA"
        
    return state

In [17]:
PUNCTUATION ='''!()-[]{};:'"\,<>./?@$%^&*_~''' # keep hashtags
STOPWORDS = stopwords.words("english")

In [24]:
def remove_emoji(string):
    """
        Removes emojis
    """
    text = string.encode("utf-8")
    allchars = [str for str in text.decode('utf-8')]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
    return clean_text

In [19]:
def clean_text(text):
    """
        Removes punctuation, does string split, and removes links
    """
    
    return_arr = []
    
    # remove punctuation
    text_no_punctuation = ""
    
    for char in text:
        if char not in PUNCTUATION:
            text_no_punctuation = text_no_punctuation + char
            
    # remove emojis
    text_no_punctuation = remove_emoji(text_no_punctuation)
    text_no_punctuation = re.sub(r'\\U[a-zA-Z0-9]{8}', '', text_no_punctuation)
    
    # remove \n and \t
    text_no_punctuation = re.sub(r'\n', '', text_no_punctuation)
    text_no_punctuation = re.sub(r'\t', '', text_no_punctuation)
    
    # remove escape sequences
    text_no_escape = ""
    
    for char in text_no_punctuation:
        try:
            char.encode('ascii')
            text_no_escape = text_no_escape + char # this'll catch chars that don't have an ascii equivalent (e.g., emojis)
        except:
            pass
    
    # add space between # and another char before it (e.g., split yes#baseball into yes #baseball)
    text_no_escape = re.sub(r"([a-zA-Z0-9]){1}#", r"\1 #", text_no_escape)
    
    # other preprocessing
    text_arr = text_no_escape.split(' ')
    
    for word in text_arr:
        
        # clean words
        word = word.lower()
        
        if "http" not in word and word.strip() != '' and word not in STOPWORDS:
            return_arr.append(word)
            
    return return_arr

In [20]:
def clean_hydrated_tweets(tweet_jsonl_path):
    
    """
    
        Takes .jsonl from Twitter, returns the cleaned df
        
    """
    
    # get uncleaned df from json
    df = pd.read_json(path_or_buf=tweet_jsonl_path, lines=True)
    df = df[["created_at", "id", "full_text", "geo", "coordinates", "place", "retweet_count", "favorite_count"]]
    
    # get state
    states = []
    
    for location_dict in df["place"]:
        try:
            states.append(get_state_from_location(location_dict))
        except Exception as e:
            print(location_dict)
            print(e)
            
    df["US_state"] = states
    
    # get dates of tweets
    dates = []
    months = []
    days = []
    hours = []

    for timestamp in df["created_at"]:
        hour = pd.to_datetime(timestamp).hour
        dt_obj = pd.to_datetime(timestamp).date()
        year = dt_obj.year
        month = dt_obj.month
        day = dt_obj.day

        hours.append(hour)
        months.append(month)
        days.append(day)

        if month < 10:
            month = f"0{month}"

        dates.append(f"{year}-{month}-{day}")

    df["date_of_tweet"] = dates
    df["month_of_tweet"] = months
    df["day_of_tweet"] = days
    df["hour_of_tweet"] = hours

    # clean the text
    df["cleaned_text"] = df["full_text"].apply(clean_text)

    # work with hashtags
    hashtags_arr = []
    num_hashtags_arr = []
    text_no_hashtags_arr = []

    for tokenized_text in df["cleaned_text"]:
        hashtag_lst = []
        text_no_hashtags_lst = []

        for word in tokenized_text:
            if '#' in word:
                hashtag_lst.append(word)
            else:
                text_no_hashtags_lst.append(word)

        hashtags_arr.append(hashtag_lst)
        num_hashtags_arr.append(len(hashtag_lst))
        text_no_hashtags_arr.append(text_no_hashtags_lst)

    df["hashtags"] = hashtags_arr
    df["hashtags_count"] = num_hashtags_arr
    df["cleaned_text_no_hashtags"] = text_no_hashtags_arr

    # get only cols that we care about
    df_small = df[["id", "full_text", "retweet_count", "favorite_count", "place", 
                   "US_state", "date_of_tweet", "month_of_tweet", "day_of_tweet", 
                   "hour_of_tweet", "cleaned_text", "hashtags", "hashtags_count", "cleaned_text_no_hashtags"]]

    return df_small

Now let's perform the cleaning steps

In [21]:
HYDRATED_TWEETS_DIR = "../../data/tweets/hydrated_tweets/"

In [25]:
tweets_df = clean_hydrated_tweets(HYDRATED_TWEETS_DIR + "2020-12-22_2021-01-09_tweets.jsonl")

#### 3. Export tweets

In [22]:
EXPORT_DIR = "../../data/tweets/"

In [29]:
tweets_df.to_csv(EXPORT_DIR + "tweets_2020-12-22_2021-01-09_with_locations.csv")