# Data Collection

In this notebook, we use the GetOldTweets3 (GOT3) library to pull tweets from certain locations within a given timeline for comparative COVID-19 sentiment analysis. 

In choosing our timeline, we focused on the period when the highest number of states were implementing the strictest versions of their lockdown policies. According to the timeline source link in the readme, this began on April 8. We decided to end our timeline on April 30, since May 1 was when many states began to relax (or plan to relax) their measures. 

Our target locations were determined in a couple of ways:
1. We picked states based on data put together on www.wallethub.com that ranked each U.S. state (and D.C.) by how "aggressive" the state's measures were in response to COVID-19. (Conveniently, this data was put together on April 7, almost exactly the day our timeline begins.) Using wallethub's ranking system (their methodology can be read in depth and is linked in the readme), we chose the top 10 and bottom 10 states ranked by COVID response severity. 

2. Because searching tweets by location with GOT3 is most effectively done by city, we then chose the largest city in each of our 20 states to serve as a proxy for the states' lockdown policies, or lack thereof. Our thinking here is that urban areas will offer a high concentration of tweets to collect, and larger cities' policies are most likely to resemble guidelines set by the state.

Though our goal is to compare sentiments in tweets posted from different states, we also wanted to see how these sentiments compared to non-pandemic times. To most accurately do this, we pulled tweets from the same time line (April 8-30) for 2019. We also wanted to see how people under different lockdown policies were tweeting about the pandemic specifically. To this end, we pulled three sets of data from Twitter:
1. Tweets by location/date 2020
2. Tweets by location/date 2019
3. Tweets by location/date 2020 w/ COVID-19 related keywords

The last part of this notebook pulls in COVID-19 infection and death count data from healthdata.org and cdc.gov. The infection data is by day, whereas the death data is by week; both are separated by state.

### Library Imports

In [1]:
import numpy as np
import pandas as pd
import time
from datetime import date, timedelta
import GetOldTweets3 as got

### Test Group: Tweets 2020

First we pulled tweets from our chosen cities for our April 2020 dates. These serve as the main corpus from which we worked moving forward. Here it is separated into two cells because the data took several hours to pull, and we didn't want to lose anything in case something threw an error or a connection went out.

In [8]:
# 10 cities representing 10 states with most severe lockdown policies
cities = ['New York City', 'Washington, D.C.', 'Anchorage',
       'Honolulu', 'Newark', 'Providence', 'Seattle', 'Boston',
       'Manchester', 'Charleston']

# creates lambda funtion to add 1 day to x
add_one = lambda x: date(2020,4,x) + timedelta(days=1)

# Sets up dates in agreed upon timeframe to use in main code below
dates = {str(date(2020,4,i)) : str(add_one(i)) for i in range(8,31)}

# Creates empty list to populate with each city's tweet dataframe
tweet_df_list = []

# Sets counter to use with timer below
count = 0

for city in cities:
    for since, until in dates.items():
        
        # Due to these cities' proximity, a smaller radius was used
        # Due to these cities' density, this should have neglible effect on data size
        if city == 'New York City' or city == 'Newark':
            # Sets location, date, language, and size criteria for collected tweets
            tweetCriteria = got.manager.TweetCriteria()\
                            .setNear(city).setWithin('5mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
        else:
            # Sets location, date, language, and size criteria for collected tweets
            tweetCriteria = got.manager.TweetCriteria()\
                            .setNear(city).setWithin('50mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
    
        # Uses criteria above with 'getTweets' method and saves list as 'tweets'
        tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    
        # Creates list of lists with tweet text and date
        tweet_info = [[tweet.text, tweet.date] for tweet in tweets]
    
        # Converts list of lists into dataframe
        tweet_df = pd.DataFrame(tweet_info, columns=['text', 'date'])
    
        # Creates column identifying tweet collection by city name
        tweet_df['city'] = city
    
        # Appends new df to df list
        tweet_df_list.append(tweet_df)
        
        # Prints progress bar
        print(f'Pulled tweets from {city} for {since}')
        
        # Counts number of requests
        count += 1
    
        # Sets 3-min timer for every 3 loops
        if count % 3 == 0:
            time.sleep(180) 

# Concatenates dfs into one large df
df = pd.concat(tweet_df_list, ignore_index=True)

# Removes rows with no text
df = df[df['text'] != ""]

# Exports large df to csv
df.to_csv('../data/tweets/raw_top10_cities_2020.csv', index=False)

# Sneak peak
df

Pulled tweets from New York City for 2020-04-08
Pulled tweets from New York City for 2020-04-09
Pulled tweets from New York City for 2020-04-10
Pulled tweets from New York City for 2020-04-11
Pulled tweets from New York City for 2020-04-12
Pulled tweets from New York City for 2020-04-13
Pulled tweets from New York City for 2020-04-14
Pulled tweets from New York City for 2020-04-15
Pulled tweets from New York City for 2020-04-16
Pulled tweets from New York City for 2020-04-17
Pulled tweets from New York City for 2020-04-18
Pulled tweets from New York City for 2020-04-19
Pulled tweets from New York City for 2020-04-20
Pulled tweets from New York City for 2020-04-21
Pulled tweets from New York City for 2020-04-22
Pulled tweets from New York City for 2020-04-23
Pulled tweets from New York City for 2020-04-24
Pulled tweets from New York City for 2020-04-25
Pulled tweets from New York City for 2020-04-26
Pulled tweets from New York City for 2020-04-27
Pulled tweets from New York City for 202

Pulled tweets from Manchester for 2020-04-09
Pulled tweets from Manchester for 2020-04-10
Pulled tweets from Manchester for 2020-04-11
Pulled tweets from Manchester for 2020-04-12
Pulled tweets from Manchester for 2020-04-13
Pulled tweets from Manchester for 2020-04-14
Pulled tweets from Manchester for 2020-04-15
Pulled tweets from Manchester for 2020-04-16
Pulled tweets from Manchester for 2020-04-17
Pulled tweets from Manchester for 2020-04-18
Pulled tweets from Manchester for 2020-04-19
Pulled tweets from Manchester for 2020-04-20
Pulled tweets from Manchester for 2020-04-21
Pulled tweets from Manchester for 2020-04-22
Pulled tweets from Manchester for 2020-04-23
Pulled tweets from Manchester for 2020-04-24
Pulled tweets from Manchester for 2020-04-25
Pulled tweets from Manchester for 2020-04-26
Pulled tweets from Manchester for 2020-04-27
Pulled tweets from Manchester for 2020-04-28
Pulled tweets from Manchester for 2020-04-29
Pulled tweets from Manchester for 2020-04-30
Pulled twe

Unnamed: 0,text,date,city
0,had a mini photoshoot in my room,2020-04-08 23:59:59+00:00,New York City
1,I’m ready for someone to confess their love fo...,2020-04-08 23:59:57+00:00,New York City
2,These were some of the worst fits and transiti...,2020-04-08 23:59:54+00:00,New York City
3,"In my opinion ALL hospital workers, from secur...",2020-04-08 23:59:48+00:00,New York City
4,Thank you all for joining Our #FirstCupOfCoffe...,2020-04-08 23:59:44+00:00,New York City
...,...,...,...
213001,“Maybe that time when they were on the beach t...,2020-04-30 00:16:23+00:00,Charleston
213002,"Bestie, Meredith, and I: @SoleTravelMama dunna...",2020-04-30 00:13:24+00:00,Charleston
213003,I legit spent a good 30 seconds trying to thin...,2020-04-30 00:12:15+00:00,Charleston
213004,OH. MY. GOD.,2020-04-30 00:10:58+00:00,Charleston


In [52]:
# 10 cities representing 10 states with least severe lockdown policies
cities = ['Little Rock', 'Jacksonville', 'Jackson', 'Houston',
       'Salt Lake City', 'Cheyenne', 'Birmingham', 'Omaha', 'Sioux Falls',
       'Oklahoma City']

# creates lambda funtion to add 1 day to x
add_one = lambda x: date(2020,4,x) + timedelta(days=1)

# Sets up dates in agreed upon timeframe to use in main code below
dates = {str(date(2020,4,i)) : str(add_one(i)) for i in range(8,31)}

# Creates empty list to populate with each city's tweet dataframe
tweet_df_list = []

# Sets counter to use with timer below
count = 0

for city in cities:
    for since, until in dates.items():
       
        # Sets location, date, language, and size criteria for collected tweets
        tweetCriteria = got.manager.TweetCriteria()\
                        .setNear(city).setWithin('50mi').setLang('en')\
                        .setSince(since).setUntil(until)\
                        .setMaxTweets(1000)
    
        # Uses criteria above with 'getTweets' method and saves list as 'tweets'
        tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    
        # Creates list of lists with tweet text and date
        tweet_info = [[tweet.text, tweet.date] for tweet in tweets]
    
        # Converts list of lists into dataframe
        tweet_df = pd.DataFrame(tweet_info, columns=['text', 'date'])
    
        # Creates column identifying tweet collection by city name
        tweet_df['city'] = city
    
        # Appends new df to df list
        tweet_df_list.append(tweet_df)
        
        # Prints progress bar
        print(f'Pulled tweets from {city} for {since}')
        
        # Counts number of requests
        count += 1
    
        # Sets 3-min timer for every 3 loops
        if count % 3 == 0:
            time.sleep(180) 

# Concatenates dfs into one large df
df = pd.concat(tweet_df_list, ignore_index=True)

# Removes rows with no text
df = df[df['text'] != ""]

# Exports large df to csv
df.to_csv('../data/tweets/raw_bottom10_cities_2020.csv', index=False)

# Sneak peak
df

Pulled tweets from Little Rock for 2020-04-08
Pulled tweets from Little Rock for 2020-04-09
Pulled tweets from Little Rock for 2020-04-10
Pulled tweets from Little Rock for 2020-04-11
Pulled tweets from Little Rock for 2020-04-12
Pulled tweets from Little Rock for 2020-04-13
Pulled tweets from Little Rock for 2020-04-14
Pulled tweets from Little Rock for 2020-04-15
Pulled tweets from Little Rock for 2020-04-16
Pulled tweets from Little Rock for 2020-04-17
Pulled tweets from Little Rock for 2020-04-18
Pulled tweets from Little Rock for 2020-04-19
Pulled tweets from Little Rock for 2020-04-20
Pulled tweets from Little Rock for 2020-04-21
Pulled tweets from Little Rock for 2020-04-22
Pulled tweets from Little Rock for 2020-04-23
Pulled tweets from Little Rock for 2020-04-24
Pulled tweets from Little Rock for 2020-04-25
Pulled tweets from Little Rock for 2020-04-26
Pulled tweets from Little Rock for 2020-04-27
Pulled tweets from Little Rock for 2020-04-28
Pulled tweets from Little Rock for

Pulled tweets from Sioux Falls for 2020-04-10
Pulled tweets from Sioux Falls for 2020-04-11
Pulled tweets from Sioux Falls for 2020-04-12
Pulled tweets from Sioux Falls for 2020-04-13
Pulled tweets from Sioux Falls for 2020-04-14
Pulled tweets from Sioux Falls for 2020-04-15
Pulled tweets from Sioux Falls for 2020-04-16
Pulled tweets from Sioux Falls for 2020-04-17
Pulled tweets from Sioux Falls for 2020-04-18
Pulled tweets from Sioux Falls for 2020-04-19
Pulled tweets from Sioux Falls for 2020-04-20
Pulled tweets from Sioux Falls for 2020-04-21
Pulled tweets from Sioux Falls for 2020-04-22
Pulled tweets from Sioux Falls for 2020-04-23
Pulled tweets from Sioux Falls for 2020-04-24
Pulled tweets from Sioux Falls for 2020-04-25
Pulled tweets from Sioux Falls for 2020-04-26
Pulled tweets from Sioux Falls for 2020-04-27
Pulled tweets from Sioux Falls for 2020-04-28
Pulled tweets from Sioux Falls for 2020-04-29
Pulled tweets from Sioux Falls for 2020-04-30
Pulled tweets from Oklahoma City f

Unnamed: 0,text,date,city
0,Give yourself a personal loan and save your mo...,2020-04-08 23:59:44+00:00,Little Rock
1,Dinner and a jail break. #LittleCHill #Robinho...,2020-04-08 23:59:09+00:00,Little Rock
2,So who needs a haircut?,2020-04-08 23:59:02+00:00,Little Rock
3,We would choose...to build a mansion and invit...,2020-04-08 23:58:58+00:00,Little Rock
4,Thanks everyone who followed me today or retwe...,2020-04-08 23:58:35+00:00,Little Rock
...,...,...,...
183948,Well Covid 19 has ruined my life in more ways ...,2020-04-30 15:48:42+00:00,Oklahoma City
183949,Major shouts to my boy @kandrobrown. This is j...,2020-04-30 15:48:40+00:00,Oklahoma City
183950,no really. explained to mom yesterday that it’...,2020-04-30 15:46:57+00:00,Oklahoma City
183951,The homies,2020-04-30 15:46:47+00:00,Oklahoma City


### Control Group: 2019 Tweets

Next, we pulled tweets from our chosen cities for April 2019. We did this because we wanted to see how our above tweets' sentiments changed from pre-COVID-19 times. Again, it is separated into two cells because the data took several hours to pull, and we didn't want to lose anything in case something threw an error or a connection went out.

In [47]:
# 10 cities representing 10 states with most severe lockdown policies
cities = ['New York City', 'Washington, D.C.', 'Anchorage',
       'Honolulu', 'Newark', 'Providence', 'Seattle', 'Boston',
       'Manchester', 'Charleston']

# creates lambda funtion to add 1 day to x
add_one = lambda x: date(2019,4,x) + timedelta(days=1)

# Sets up dates in agreed upon timeframe to use in main code below
dates = {str(date(2019,4,i)) : str(add_one(i)) for i in range(8,31)}

# Creates empty list to populate with each city's tweet dataframe
tweet_df_list = []

# Sets counter to use with timer below
count = 0

for city in cities:
    for since, until in dates.items():
        
        # Due to these cities' proximity, a smaller radius was used
        # Due to these cities' density, this should have neglible effect on data size
        if city == 'New York City' or city == 'Newark':
            # Sets location, date, language, and size criteria for collected tweets
            tweetCriteria = got.manager.TweetCriteria()\
                            .setNear(city).setWithin('5mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
        else:
            # Sets location, date, language, and size criteria for collected tweets
            tweetCriteria = got.manager.TweetCriteria()\
                            .setNear(city).setWithin('50mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
    
        # Uses criteria above with 'getTweets' method and saves list as 'tweets'
        tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    
        # Creates list of lists with tweet text and date
        tweet_info = [[tweet.text, tweet.date] for tweet in tweets]
    
        # Converts list of lists into dataframe
        tweet_df = pd.DataFrame(tweet_info, columns=['text', 'date'])
    
        # Creates column identifying tweet collection by city name
        tweet_df['city'] = city
    
        # Appends new df to df list
        tweet_df_list.append(tweet_df)
        
        # Prints progress bar
        print(f'Pulled tweets from {city} for {since}')
        
        # Counts number of requests
        count += 1
    
        # Sets 3-min timer for every 3 loops
        if count % 3 == 0:
            time.sleep(180) 

# Concatenates dfs into one large df
df = pd.concat(tweet_df_list, ignore_index=True)

# Removes rows with no text
df = df[df['text'] != ""]

# Exports large df to csv
df.to_csv('../data/tweets/raw_top10_cities_2019.csv', index=False)

# Sneak peak
df

Pulled tweets from New York City for 2019-04-08
Pulled tweets from New York City for 2019-04-09
Pulled tweets from New York City for 2019-04-10
Pulled tweets from New York City for 2019-04-11
Pulled tweets from New York City for 2019-04-12
Pulled tweets from New York City for 2019-04-13
Pulled tweets from New York City for 2019-04-14
Pulled tweets from New York City for 2019-04-15
Pulled tweets from New York City for 2019-04-16
Pulled tweets from New York City for 2019-04-17
Pulled tweets from New York City for 2019-04-18
Pulled tweets from New York City for 2019-04-19
Pulled tweets from New York City for 2019-04-20
Pulled tweets from New York City for 2019-04-21
Pulled tweets from New York City for 2019-04-22
Pulled tweets from New York City for 2019-04-23
Pulled tweets from New York City for 2019-04-24
Pulled tweets from New York City for 2019-04-25
Pulled tweets from New York City for 2019-04-26
Pulled tweets from New York City for 2019-04-27
Pulled tweets from New York City for 201

Pulled tweets from Manchester for 2019-04-09
Pulled tweets from Manchester for 2019-04-10
Pulled tweets from Manchester for 2019-04-11
Pulled tweets from Manchester for 2019-04-12
Pulled tweets from Manchester for 2019-04-13
Pulled tweets from Manchester for 2019-04-14
Pulled tweets from Manchester for 2019-04-15
Pulled tweets from Manchester for 2019-04-16
Pulled tweets from Manchester for 2019-04-17
Pulled tweets from Manchester for 2019-04-18
Pulled tweets from Manchester for 2019-04-19
Pulled tweets from Manchester for 2019-04-20
Pulled tweets from Manchester for 2019-04-21
Pulled tweets from Manchester for 2019-04-22
Pulled tweets from Manchester for 2019-04-23
Pulled tweets from Manchester for 2019-04-24
Pulled tweets from Manchester for 2019-04-25
Pulled tweets from Manchester for 2019-04-26
Pulled tweets from Manchester for 2019-04-27
Pulled tweets from Manchester for 2019-04-28
Pulled tweets from Manchester for 2019-04-29
Pulled tweets from Manchester for 2019-04-30
Pulled twe

Unnamed: 0,text,date,city
0,I have so many more ny bfs now,2019-04-08 23:59:59+00:00,New York City
1,#science as #art in @newyorkcity! such a seren...,2019-04-08 23:59:58+00:00,New York City
2,"""Beautiful and perfect"" are great synonyms for...",2019-04-08 23:59:57+00:00,New York City
3,"Close enough, I actually also mentioned values...",2019-04-08 23:59:54+00:00,New York City
4,Definitely! It’s been too long.,2019-04-08 23:59:50+00:00,New York City
...,...,...,...
211484,Is there any truth to sites around the country...,2019-04-30 00:06:18+00:00,Charleston
211485,Hi #Novel19s! My middlegrade fantasy adventure...,2019-04-30 00:06:13+00:00,Charleston
211486,I have three shots left in this polaroid big s...,2019-04-30 00:04:39+00:00,Charleston
211487,An Americans will NEVER give up their weapons ...,2019-04-30 00:03:37+00:00,Charleston


In [None]:
# 10 cities representing 10 states with most severe lockdown policies
cities = ['Little Rock', 'Jacksonville', 'Jackson', 'Houston',
       'Salt Lake City', 'Cheyenne', 'Birmingham', 'Omaha', 'Sioux Falls',
       'Oklahoma City']

# creates lambda funtion to add 1 day to x
add_one = lambda x: date(2019,4,x) + timedelta(days=1)

# Sets up dates in agreed upon timeframe to use in main code below
dates = {str(date(2019,4,i)) : str(add_one(i)) for i in range(8,31)}

# Creates empty list to populate with each city's tweet dataframe
tweet_df_list = []

# Sets counter to use with timer below
count = 0

for city in cities:
    for since, until in dates.items():
        
        # Due to these cities' proximity, a smaller radius was used
        # Due to these cities' density, this should have neglible effect on data size
        if city == 'New York City' or city == 'Newark':
            # Sets location, date, language, and size criteria for collected tweets
            tweetCriteria = got.manager.TweetCriteria()\
                            .setNear(city).setWithin('5mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
        else:
            # Sets location, date, language, and size criteria for collected tweets
            tweetCriteria = got.manager.TweetCriteria()\
                            .setNear(city).setWithin('50mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
    
        # Uses criteria above with 'getTweets' method and saves list as 'tweets'
        tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    
        # Creates list of lists with tweet text and date
        tweet_info = [[tweet.text, tweet.date] for tweet in tweets]
    
        # Converts list of lists into dataframe
        tweet_df = pd.DataFrame(tweet_info, columns=['text', 'date'])
    
        # Creates column identifying tweet collection by city name
        tweet_df['city'] = city
    
        # Appends new df to df list
        tweet_df_list.append(tweet_df)
        
        # Prints progress bar
        print(f'Pulled tweets from {city} for {since}')
        
        # Counts number of requests
        count += 1
    
        # Sets 3-min timer for every 3 loops
        if count % 3 == 0:
            time.sleep(180) 

# Concatenates dfs into one large df
df = pd.concat(tweet_df_list, ignore_index=True)

# Removes rows with no text
df = df[df['text'] != ""]

# Exports large df to csv
df.to_csv('../data/tweets/raw_bottom10_cities_2019.csv', index=False)

# Sneak peak
df

### Tweets By Keywords 2020

For our final tweet pull, we ran a search through our 2020 dataset to find the most common words related to COVID-19 (we started with a list found at the link below and modified it as we saw fit). We also narrowed our focus to top 5 and bottom 5 states' cities. This keyword search was done in notebook 02, but we included the tweet pull here to keep our data collection code in notebook 01.

https://github.com/echen102/COVID-19-TweetIDs

In [51]:
# 5 cities representing 5 states with most severe lockdown policies
cities = ['New York City', 'Washington, D.C.', 'Anchorage',
       'Honolulu', 'Newark']

# creates lambda funtion to add 1 day to x
add_one = lambda x: date(2020,4,x) + timedelta(days=1)

# Sets up dates in agreed upon timeframe to use in main code below
dates = {str(date(2020,4,i)) : str(add_one(i)) for i in range(8,31,3)}

# Creates empty list to populate with each city's tweet dataframe
tweet_df_list = []
count = 0
for city in cities:
    for since, until in dates.items(): 
        for keyword in ['coronavirus', 'COVID-19', 'lockdown', 
                        'pandemic', 'mask', 'quarantine', 'distancing']:
        
            # Due to these cities' proximity, a smaller radius was used
            # Due to these cities' density, this should have neglible effect on data size
            if city == 'New York City' or city == 'Newark':
                # Sets location, date, language, and size criteria for collected tweets
                tweetCriteria = got.manager.TweetCriteria().setQuerySearch(keyword)\
                            .setNear(city).setWithin('5mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
            else:
                # Sets location, date, language, and size criteria for collected tweets
                tweetCriteria = got.manager.TweetCriteria().setQuerySearch(keyword)\
                            .setNear(city).setWithin('50mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
    
            # Uses criteria above with 'getTweets' method and saves list as 'tweets'
            tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    
            # Creates list of lists with tweet text and date
            tweet_info = [[tweet.text, tweet.date] for tweet in tweets]
    
            # Converts list of lists into dataframe
            tweet_df = pd.DataFrame(tweet_info, columns=['text', 'date'])
    
            # Creates column identifying tweet collection by city name
            tweet_df['city'] = city
    
            # Appends new df to df list
            tweet_df_list.append(tweet_df)
        
            # Keeps track of where in the for loops you are
            print(f'Pulled {keyword} tweets from {city} for {since}')
        
            count += 1
    
            # Sets 3-min timer for every 3 loops
            if count % 3 == 0:
                time.sleep(180) 

# Concatenates dfs into one master df
df = pd.concat(tweet_df_list, ignore_index=True)

# Removes rows with no text
df = df[df['text'] != ""]

# Exports tweets df so we can skip above process if we want to change anything
df.to_csv('../data/tweets/raw_top5_cities_keywords.csv', index=False)

df

Pulled coronavirus tweets from New York City for 2020-04-08
Pulled COVID-19 tweets from New York City for 2020-04-08
Pulled lockdown tweets from New York City for 2020-04-08
Pulled pandemic tweets from New York City for 2020-04-08
Pulled mask tweets from New York City for 2020-04-08
Pulled quarantine tweets from New York City for 2020-04-08
Pulled distancing tweets from New York City for 2020-04-08
Pulled coronavirus tweets from New York City for 2020-04-11
Pulled COVID-19 tweets from New York City for 2020-04-11
Pulled lockdown tweets from New York City for 2020-04-11
Pulled pandemic tweets from New York City for 2020-04-11
Pulled mask tweets from New York City for 2020-04-11
Pulled quarantine tweets from New York City for 2020-04-11
Pulled distancing tweets from New York City for 2020-04-11
Pulled coronavirus tweets from New York City for 2020-04-14
Pulled COVID-19 tweets from New York City for 2020-04-14
Pulled lockdown tweets from New York City for 2020-04-14
Pulled pandemic tweets

Pulled lockdown tweets from Anchorage for 2020-04-20
Pulled pandemic tweets from Anchorage for 2020-04-20
Pulled mask tweets from Anchorage for 2020-04-20
Pulled quarantine tweets from Anchorage for 2020-04-20
Pulled distancing tweets from Anchorage for 2020-04-20
Pulled coronavirus tweets from Anchorage for 2020-04-23
Pulled COVID-19 tweets from Anchorage for 2020-04-23
Pulled lockdown tweets from Anchorage for 2020-04-23
Pulled pandemic tweets from Anchorage for 2020-04-23
Pulled mask tweets from Anchorage for 2020-04-23
Pulled quarantine tweets from Anchorage for 2020-04-23
Pulled distancing tweets from Anchorage for 2020-04-23
Pulled coronavirus tweets from Anchorage for 2020-04-26
Pulled COVID-19 tweets from Anchorage for 2020-04-26
Pulled lockdown tweets from Anchorage for 2020-04-26
Pulled pandemic tweets from Anchorage for 2020-04-26
Pulled mask tweets from Anchorage for 2020-04-26
Pulled quarantine tweets from Anchorage for 2020-04-26
Pulled distancing tweets from Anchorage fo

Unnamed: 0,text,date,city
0,They're here #angels #coronavirus #usa #nyc #t...,2020-04-08 23:53:12+00:00,New York City
1,Corona virus snapped,2020-04-08 23:50:20+00:00,New York City
2,.@Elaine_Quijano wrapping up tonight with anal...,2020-04-08 23:41:37+00:00,New York City
3,That’s a great question! How are they diagnosi...,2020-04-08 23:40:47+00:00,New York City
4,NPR: Coronavirus Fears Shouldn't Keep You From...,2020-04-08 23:40:10+00:00,New York City
...,...,...,...
34419,"Emily, you, @SumedhMankarDO, @ricardoyoung02, ...",2020-04-29 00:54:24+00:00,Newark
34420,"It was a social distancing, PPE distribution, ...",2020-04-29 00:48:30+00:00,Newark
34421,I'm tired of seeing all these ig comedians not...,2020-04-29 00:38:48+00:00,Newark
34422,Me too. We are still under social distancing l...,2020-04-29 00:36:36+00:00,Newark


In [None]:
# 5 cities representing 5 states with least severe lockdown policies
cities = ['Cheyenne', 'Birmingham', 'Omaha',
       'Sioux Falls', 'Oklahoma City']

# creates lambda funtion to add 1 day to x
add_one = lambda x: date(2020,4,x) + timedelta(days=1)

# Sets up dates in agreed upon timeframe to use in main code below
dates = {str(date(2020,4,i)) : str(add_one(i)) for i in range(8,31,3)}

# Creates empty list to populate with each city's tweet dataframe
tweet_df_list = []
count = 0
for city in cities:
    for since, until in dates.items(): 
        for keyword in ['coronavirus', 'COVID-19', 'lockdown', 
                        'pandemic', 'mask', 'quarantine', 'distancing']:
        
            # Due to these cities' proximity, a smaller radius was used
            # Due to these cities' density, this should have neglible effect on data size
            if city == 'New York City' or city == 'Newark':
                # Sets location, date, language, and size criteria for collected tweets
                tweetCriteria = got.manager.TweetCriteria().setQuerySearch(keyword)\
                            .setNear(city).setWithin('5mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
            else:
                # Sets location, date, language, and size criteria for collected tweets
                tweetCriteria = got.manager.TweetCriteria().setQuerySearch(keyword)\
                            .setNear(city).setWithin('50mi').setLang('en')\
                            .setSince(since).setUntil(until)\
                            .setMaxTweets(1000)
    
            # Uses criteria above with 'getTweets' method and saves list as 'tweets'
            tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    
            # Creates list of lists with tweet text and date
            tweet_info = [[tweet.text, tweet.date] for tweet in tweets]
    
            # Converts list of lists into dataframe
            tweet_df = pd.DataFrame(tweet_info, columns=['text', 'date'])
    
            # Creates column identifying tweet collection by city name
            tweet_df['city'] = city
    
            # Appends new df to df list
            tweet_df_list.append(tweet_df)
        
            # Keeps track of where in the for loops you are
            print(f'Pulled {keyword} tweets from {city} for {since}')
        
            count += 1
    
            # Sets 3-min timer for every 3 loops
            if count % 3 == 0:
                time.sleep(180) 

# Concatenates dfs into one master df
df = pd.concat(tweet_df_list, ignore_index=True)

# Removes rows with no text
df = df[df['text'] != ""]

# Exports tweets df so we can skip above process if we want to change anything
df.to_csv('../data/tweets/raw_bottom5_cities_keywords.csv', index=False)

df

### COVID-19 Data by State/Date

Finally, we pulled data from the two sources linked above and ordered them into dataframes so we could use them later in our analysis.

In [12]:
# Reads in downloaded infection data
cov_infect = pd.read_csv('../data/health/Hospitalization_all_locs.csv')

# Creates list of states so we only use the data we need
states = ['Alaska', 'Arkansas', 'Oklahoma', 'Alabama', 'District of Columbia', 'New York',
         'Hawaii', 'Mississippi', 'Texas', 'Florida', 'Nebraska', 'Utah', 'Wyoming',
         'South Dakota', 'New Jersey', 'Rhode Island', 'Massachusetts', 'Washington',
         'New Hampshire', 'West Virginia', 'Washington, D.C.']

# Weeds out states not used in our project using list above
cov_infect = cov_infect[cov_infect['location_name'].isin(states)]

# Selects only the dates that we are focusing on for our project
cov_infect = cov_infect[(cov_infect['date'] >= '2020-04-08') & (cov_infect['date'] <= '2020-04-30')]

# Saves only the columns we need for our project
cov_infect = cov_infect[['location_name', 'date', 'confirmed_infections']]

# Renames location column
cov_infect.rename(columns={'location_name':'state'}, inplace=True)

# Flags whether a state is in the top 10 or bottom 10 aggressive response list
top_10 = ['New York', 'District of Columbia', 'Alaska', 'Hawaii',
       'New Jersey', 'Rhode Island', 'Washington', 'Massachusetts',
       'New Hampshire', 'West Virginia']
cov_infect['top_10'] = np.where(cov_infect['state'].isin(top_10), 1, 0)

cov_infect

Unnamed: 0,state,date,confirmed_infections,top_10
757,Alabama,2020-04-08,159.0,0
758,Alabama,2020-04-09,375.0,0
759,Alabama,2020-04-10,244.0,0
760,Alabama,2020-04-11,270.0,0
761,Alabama,2020-04-12,346.0,0
...,...,...,...,...
51678,Wyoming,2020-04-26,11.0,0
51679,Wyoming,2020-04-27,18.0,0
51680,Wyoming,2020-04-28,16.0,0
51681,Wyoming,2020-04-29,9.0,0


In [13]:
# Saves infections dataframe as csv
cov_infect.to_csv('../data/health/infections.csv', index=False)

In [14]:
# Reads in downloaded death rate data
cov_deaths = pd.read_csv('../data/health/Provisional_COVID-19_Death_Counts_by_Week_Ending_Date_and_State.csv')

# Weeds out states not used in our project using list above
cov_deaths = cov_deaths[cov_deaths['State'].isin(states)]

# Converts date column to datetime since it was formatted differently than other datasets
cov_deaths['End Week'] = pd.to_datetime(cov_deaths['End Week'], utc=True)

# Selects only the dates that we are focusing on for our project
cov_deaths = cov_deaths[(cov_deaths['End Week'] >= '2020-04-04') & (cov_deaths['End Week'] <= '2020-05-02')]

# Removes timestamp from data
cov_deaths['End Week'] = cov_deaths['End Week'].dt.date

# Saves only the columns we need for our project
cov_deaths = cov_deaths[['End Week', 'State', 'COVID-19 Deaths']]

# Renames columns
cov_deaths.rename(columns={'State':'state', 'End Week': 'date', 'COVID-19 Deaths': 'cov_deaths'}, inplace=True)

# Fills null values
cov_deaths.fillna(0)

Unnamed: 0,date,state,cov_deaths
29,2020-04-04,Alabama,49.0
30,2020-04-11,Alabama,80.0
31,2020-04-18,Alabama,88.0
32,2020-04-25,Alabama,80.0
33,2020-05-02,Alabama,75.0
...,...,...,...
1049,2020-04-04,Wyoming,0.0
1050,2020-04-11,Wyoming,0.0
1051,2020-04-18,Wyoming,0.0
1052,2020-04-25,Wyoming,0.0


In [15]:
# Saves deaths dataframe as csv
cov_deaths.to_csv('../data/health/deaths.csv', index=False)