# Part 1: Data Collection

### Overview:
This notebook creates basic functions to parse the COVID-19 GitHub repo, get Tweet IDs from a text file, open tweet JSON from a Tweet ID, etc. It also does initial COVID-19+Transportation tasks by parsing our keywords data set and defining functions to fetch COVID-19+Transportation tweets. Lastly, these functions help make up the `general-collector.py` script which we use to scrape most of our data from the COVID-19 repo.

### Table of Contents
1. [Parse COVID-19 Dataset](#1)
    1. [Access the COVID-19 dataset files](#a)
    2. [Get Tweet IDs from a text file](#b)
2. [Parse Tweet IDs for text body](#2)
    1. [Set up Twitter API keys and initialize necessary libraries](#2a)
    2. ['Transform' keywords csv file and define functions for keyword searching](#2b)
    3. [Save the date range of the transportation tweets saved in each file](#2c)
    3. [Define functions to fetch and save COVID-19+Transportation data](#2d)
3. [Finally, call the Twitter API](#3) 
4. [Dehydrate `jsonl` files back to `txt` file of tweet IDs](#4)
5. [Update: Data Collection through `general-collector.py` (4/7/2022)](#5)

### My data collection involves a three-step process:
1. Access the tweet ID text files from the [COVID-19 dataset](https://github.com/echen102/COVID-19-TweetIDs).
2. Combine tweet ID text files into larger documents to "rehydrate" the tweet IDs into their full json form.
3. Filter COVID tweets by keywords related to transportation.


Goal: create training dataset. No corresponding labels (this is unsupervised)

## (1) Parse COVID-19 Dataset <a href name id="1">

#### Key Events
Using the [CDC COVID-19 Timeline](https://www.cdc.gov/museum/timeline/covid19.html), we focus our data collection on key time periods in which tweets related to COVID-19 intersect with phrases related to transportation.

Origins: December 2019 - January 2020
- December 12, 2019: Patients in Wuhan begin experiencing symptoms of what later becomes known as COVID-19.
- January 20, 2020: CDC confirms the first U.S. laboratory-confirmed case of COVID-19 in the U.S. from samples taken on January 18 in Washington state.

Early Pandemic: February 2020 - May 2020
- February 23, 2020: Italy becomes a global COVID-19 hotspot. 
- [February 26, 2020](https://www.cdc.gov/media/releases/2020/t0225-cdc-telebriefing-covid-19.html): CDC’s Dr. Nancy Messonnier, Incident Manager for the COVID-19 Response, holds a telebriefing. During the telebriefing she braces the U.S. for the eventual community spread of the novel coronavirus and states that the “disruption to everyday life may be severe.”
- March 11, 2020: The World Health Organization declares COVID-19 a pandemic.
- March 13, 2020: President Donald J. Trump declares a nationwide emergency.
- March 15, 2020: U.S. states begin to shut down to prevent the spread of COVID-19. New York City public schools system (the largest school system in the U.S., with 1.1 million students) shuts down, while Ohio calls for restaurants and bars to close.
- March 28, 2020: White House extends social distancing measures until the end of April 2020.
- April 3, 2020: At a White House press briefing, CDC announces new mask wearing guidelines and recommends that all people wear a mask when outside of the home.
- May 2, 2020: World Health Organization renews its emergency declaration from three months prior calling the pandemic a global health crisis.
- May 8, 2020: News media outlets report that top White House officials shelve CDC “Guidance for Implementing the Opening Up America Again Framework” that include detailed advice on how to safely reopen the country.
- July 23, 2020: CDC releases new science-based resources and tolls for school administrators, teachers, parents, guardians, and caregivers for safe school reopening.

COVID Vaccines Rollout: December 2020 - April 2021
- December 11, 2020: Food and Drug Administration issues an Emergency Use Authorization (EUA) for the first COVID-19 vaccine – the Pfizer-BioNTech COVID-19 vaccine.
- December 24, 2020: It is estimated that more than 1 million people in the U.S. are vaccinated against COVID-19.
- March 8, 2021: CDC announces that fully vaccinated people can gather indoors without masks.
- April 2, 2021: CDC announces fully vaccinated individuals can travel safely domestically in the U.S. without a COVID test first.
- July 27, 2021: After a substantial upswing in cases due to the Delta variant, CDC releases updated guidance for everyone in areas with substantial or high transmission to wear a mask while indoors.
- November 29, 2021: CDC recommends that everyone over 18 years old who received a Pfizer or Moderna vaccine receive a COVID-19 booster shot 6 months after they are fully vaccinated.

- **March 10, 2022: At CDC’s recommendation, TSA extends the security directive for mask use on public transportation and transportation hubs for one month, through April 18th.**



https://www.cdc.gov/media/releases/2020/t0225-cdc-telebriefing-covid-19.html

In [1]:
import pandas as pd
import requests
import io
import os
import json
import datetime

# twitter API-related
import tweepy
from tqdm import tqdm
from twarc import Twarc

### Access the COVID-19 dataset files <a href name id="a">
According to the COVID-19 dataset documentation, the Tweet-IDs are organized as follows:
- Tweet-ID files are stored in folders that indicate the year and month of the collection (YEAR-MONTH).
- Individual Tweet-ID files contain a collection of Tweet IDs, and the file names all follow the same structure, with a prefix “coronavirus-tweet-id-” followed by the YEAR-MONTH-DATE-HOUR.

The COVID-19 Tweet IDs are uploaded from 0:00-23:00, representing each hour in the day. Since some files may be missing (several hours in the day did not upload), we iterate over the files using the `YEAR-MONTH-DATE-HOUR` pattern, ignoring any URLs in which there is a `404 Not Found` Error.

In [2]:
# get outer directories
base_url = 'https://github.com/echen102/COVID-19-TweetIDs/tree/master/'
time_range = ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12', 
            '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12', 
            '2022-01', '2022-02', '2022-03', '2022-04']

# show the Tweet-ID `YEAR-MONTH` folders
for yyyyMM in time_range:
    print(base_url+yyyyMM)

https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-01
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-02
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-03
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-04
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-05
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-06
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-07
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-08
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-09
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-10
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-11
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-12
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2021-01
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2021-02
https://github.com/echen102/COVID-19-TweetIDs/tree/master/2021-03
https://gi

In [5]:
# selected_time_range = ['2020-01', '2020-02']

def fetchAllTextFileURLs(time_range):
    files = []
    base_url = 'https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/'
    prefix = 'coronavirus-tweet-id'
    dates = list(range(1, 32)) # day 1 to 31
    hours = list(range(24)) # hour 0 to 23
    
    for yyyyMM in time_range:
        for date in dates:
            date = f'0{date}' if date < 10 else f'{date}'
            for hour in hours:
                hour = f'0{hour}' if hour < 10 else f'{hour}'
                fileURL = f'{base_url}{yyyyMM}/{prefix}-{yyyyMM}-{date}-{hour}.txt'
                files.append(fileURL)

                # print(fileURL)
    
    return files

# testFiles = fetchAllTextFileURLs(selected_time_range)
testFiles = fetchAllTextFileURLs(time_range)

print('Total files in Time Range:', len(testFiles))

Total files in Time Range: 20832


### Get Tweet IDs from a text file. <a href name id="b">
The algorithm we used to generate the dataset files used the dataset's labels to will generate all dates/times between January 1, 2020 and April, 2022 if we set the time range as 1/2020 to 4/2022. Since the COVID-19's data collection did not begin until January 21 and known data collection gaps have occurred from then to now, we cannot rely on the URL being 100% valid. 

Thus, for each url generated, we call `getTweetIDsFrom(txtFile)` to check if the file exists. If it has a `404 File Not Found` error, then we ignore it. Otherwise, we run through the `.txt` file to retrieve all the tweet IDs.

In [6]:
def getTweetIDsFrom(txtFile):
    '''
    Retrieves tweet IDs from the text file's URL and stores in list
    '''
    tweetList = []
    response = requests.get(txtFile) # requests raw file from Github
    
    if response.status_code == 200:
        content = response.content.decode('utf-8')
        tweetList = content.split()
        
    return tweetList

In [7]:
# sample URLS for testing
url2020 = testFiles[535]
url2021 = testFiles[1000]

print(url2020)
tweets2020 = getTweetIDsFrom(url2020)
print(url2021)
tweets2021 = getTweetIDsFrom(url2021)

print("January 23, 2020, at 7am :",len(tweets2020))
print("February 11, 2021, at 4pm:", len(tweets2021))

https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-01/coronavirus-tweet-id-2020-01-23-07.txt
https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-02/coronavirus-tweet-id-2020-02-11-16.txt
January 23, 2020, at 7am : 2006
February 11, 2021, at 4pm: 41281


From our two tests, we see that the volume of tweets increased significantly as recognition of the COVID-19 pandemic spread. In late January 2020, we see that there are just ~2000 hourly tweets about COVID-19, while in mid February 2021, there are about ~41,000 hourly tweets.  

## (2) Parse Tweet IDs for text body. <a href name id='2'>

We are now ready to begin our search for transportation data within the COVID-19 dataset. We created a "keywords" list of possible phrases and words related to transportation and mobility. The list is not exhaustive, but seeks to include phrases that are not directly related to my analysis to ensure we have as robust a dataset as possible.

`Twarc` is a Python library that collects and archives Twitter data via the Twitter API. The benefit of `Twarc` is that it contains two methods, called `hydrate` and `dehydrate` which can generate tweet JSON from a `.txt` file of tweet IDs and generate tweet IDs from a file of tweets.

In our analysis, we will first generate json files of `COVID-19+transportation` tweets by iterating through the COVID-19 dataset. After generating json files capped ~1000 tweets, we will use Twarc to `dehydrate` the tweets into a compact `.txt` file.

### Set up Twitter API keys and initialize necessary libraries <a href name id="2a">

In [8]:
# ****Set up to access Twitter API****

# Assign developer keys
consumer_key='Hdzqjxf3mVyw5DQLUISQO8Dsz'
consumer_secret='9NN8pQz1gKfk8wIy8VnPXM2TWtvmvnW3n19rAavuo5MGiirDey'
access_token='1345900794625847301-pPqmLUdewBlbRx5awibCmvajKh6qnK'
access_token_secret='bVVGLYrMxqssj1RkEUigABAd5mBZ9i4RL1nP6j6tbeDmm'
  
# authorization of consumer key and consumer secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
  
# set access to user's access key and access secret 
auth.set_access_token(access_token, access_token_secret)
  
# calling the api 
api = tweepy.API(auth)

### 'Transform' keywords csv file and define functions for keyword searching <a href name id='2b'> 

In [11]:
# Transform keywords list

keywords = pd.read_csv('./cs315_keywords.csv')
keywords.head()
keywords['public transport']=keywords['public transport'].str.lower()
keywords['motorized']=keywords['motorized'].str.lower()
keywords['non-motorized']=keywords['non-motorized'].str.lower()

transit = keywords['public transport'].tolist()
motorized = keywords['motorized'].tolist()
nonmotorized = keywords['non-motorized'].tolist()

keywords = set(transit + motorized + nonmotorized)

print(keywords)
print(len(keywords))

{nan, 'red line', 'buick', 'bike', 'chevrolet', 'monorail', 'autos', 'mass transit', 'rail', 'walks', 'automobiles', 'light rail', 'traffic', 'chrysler', 'jog', 'ford', 'muni', 'road', 'electric scooter', 'fiat', 'nissan', 'railroads', 'streetcars', 'mazda', 'vehicles', 'gasoline', 'silver line', 'cars', 'railway', 'scooters', 'kia', 'bart', 'bicycling', 'caltrain', 'jeep', 'lexus', 'rollerskate', 'commuter', 'route', 'rideshare', 'run', 'rollerblading', 'mta', 'roads', 'transportation', 'uber', 'metropolitan', 'hyundai', 'streetcar', 'railroad', 'commute', 'bus', 'walking', 'shuttle', 'amtrak', 'e-scooter', 'auto', 'bicycles', 'diesel', 'cable car', 'toyota', 'jogging', 'blue line', 'station', 'mass transportation', 'congestion', 'rides', 'honda', 'trolley', 'bicycle', 'engine', 'lincoln', 'commuter rail', 'highway', 'subaru', 'roadway', 'gas prices', 'lane', 'carpooling', 'workout', 'carpool', 'green line', 'bikers', 'walkers', 'running', 'exercise', 'transport', 'subway', 'rapid tra

In [10]:
def containsKeyword(text, keywords):
    '''
    Checks if a word in the tweet body is located in the keywords list.
    '''    
    
    loweredText = [word.lower() for word in text.split()]
    textSet = set(loweredText) 
    
    intersection = textSet & keywords
    
    return False if not intersection else True

print(containsKeyword("hi there car subaru drive", keywords))
print(containsKeyword("Creating a Grocery List Manager Using Angular, Part 1: Add &amp; Display Items https://t.co/xFox78juL1 #Angular", keywords))

True
False


### Save the date range of the transportation tweets saved in each file <a href name id="2c">

First, we generate functions to patternize the file names. Each `.jsonl` file will be prefixed with `covid-mobility-tweets-starting` followed by the date of the first tweet in the file.

Template: `covid-mobility-tweets-starting-YYYY-MM-DD.jsonl`

Example: `covid-mobility-tweets-starting-2020-05-01.jsonl`


In [18]:
# test datetime object format
date_time_str = "Wed Oct 10 20:19:24 +0000 2018"
date_time_obj = datetime.datetime.strptime(date_time_str, '%a %b %d %H:%M:%S %z %Y')
time = date_time_obj.strftime("%H:%M:%S")
print("***TESTS***")
print("Year Month Day", date_time_obj.year, date_time_obj.month, date_time_obj.day, time)

# note: I removed some keys for brevity
sampleTweet = {
    'created_at': 'Fri May 01 00:11:52 +0000 2020',
 'id': 1256013431855108096,
 'id_str': '1256013431855108096',
 'text': 'RT @VEJA: Covid-19 avança nas favelas — mas ruas continuam cheias com gente que não pode parar e se proteger https://t.co/rPk8eequN5 https:…',
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'VEJA',
    'name': 'VEJA',
    'id': 17715048,
    'id_str': '17715048',
    'indices': [3, 8]}],
  'urls': [{'url': 'https://t.co/rPk8eequN5',
    'expanded_url': 'https://veja.abril.com.br/brasil/o-coronavirus-chega-a-favela/',
    'display_url': 'veja.abril.com.br/brasil/o-coron…',
    'indices': [109, 132]}]},
 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
 'geo': None,
 'coordinates': None,
 'place': None,
 'contributors': None,
}

# Test helper functions
print("Tweet File Name: ", formatFileName([sampleTweet]))
print(fetchStartAndEndDate([sampleTweet], 1))

***TESTS***
Year Month Day 2018 10 10 20:19:24
Tweet File Name:  covid-mobility-tweet-starting-2020-5-1-0.jsonl
Date Range: 2020-5-1-0 00:11:52 to 2020-5-1 00:11:52


In [19]:
# helper functions used for saveTransportationTweets()
def fetchStartAndEndDate(tweetArr, count):
    '''
    Returns the first and last tweet in the dataset
    in the format YEAR-MONTH-DATE-HOUR to YEAR-MONTH-DATE-HOUR 
    '''
    startTimeStr = tweetArr[0]["created_at"]
    s = datetime.datetime.strptime(startTimeStr, '%a %b %d %H:%M:%S %z %Y')
    sTime = s.strftime("%H:%M:%S")
    # print("s",sTime)
    endTimeStr = tweetArr[count-1]["created_at"]
    e = datetime.datetime.strptime(endTimeStr, '%a %b %d %H:%M:%S %z %Y')
    eTime = e.strftime("%H:%M:%S")
    # print("e",eTime)

    dateRange = f'Date Range: {s.year}-{s.month}-{s.day}-{s.hour} {sTime} to {e.year}-{e.month}-{e.day} {eTime}'
    
    return dateRange 

def formatFileName(tweetArr):
    '''
    Formats the file name of the json files according to the first and last tweet in the dataset
    using the prefix "covid-mobility-tweet-starting" followed by YEAR-MONTH-DATE~HH-mm-ss.
    '''
    startTimeStr = tweetArr[0]["created_at"]
    s = datetime.datetime.strptime(startTimeStr, '%a %b %d %H:%M:%S %z %Y')
    sTime = s.strftime("%H-%M-%S")

    base_path = f'covid-mobility-tweet-starting-{s.year}-{s.month}-{s.day}'
    filename_0 = base_path+'-0'
    
    if not os.path.exists(f'{filename_0}.jsonl'): # path with this date does not exist yet
        return filename_0+'.jsonl'
    else:    
        # save file as the next `filename-i` filepath for this date
        i = 1
        while os.path.exists(f'{base_path}-{i}.jsonl'):
            i+=1
        filename_i = f'{base_path}-{i}'
    
        return filename_i+'.jsonl'
        

def writeToJsonlFile(tweetArr, dateRange, filename):
    '''
    Writes a list of tweets to a given .jsonl file. Note that .jsonl and .json differ in structure.
    '''
    filepath = './covid-mobility-tweets/'+filename
    with open(filepath, 'w') as outfile:
        outfile.write(f'{{{dateRange}}}\n') # dumps the date range in json format
        for entry in tweetArr:
            json.dump(entry, outfile)
            outfile.write('\n')

formatFileName([sampleTweet])

'covid-mobility-tweet-starting-2020-5-1-0.jsonl'

The next few functions make up the core of the data collection. The `saveTransportationTweets()` function iterates through the list of tweets given an array of tweet IDs, adding tweets related to our transportation keywords to an transportation tweet array. When the size of the transportation tweet array reaches 1000, we save the file as a `.jsonl` file and reiterate through the process.

### Define functions to fetch and save `COVID-19+Transportation` data. <a href name id='2d'>

In [20]:
def saveTransportationTweets(tweetIDs, keywords):
    '''
    Iterates through list of tweets and identifies tweets containing keywords. 
    
    When the number of transportation tweets surpasses 1000, save the current tweets 
    as a json file and reset the json variable.
    '''
    tweetArr = []
    count = 0
    for tweet in tweetIDs:
        # save json file and create new one
        if count == 1000:
            dateRange = fetchStartAndEndDate(tweetArr, count) # save date of first and last tweet
            filename = formatFileName(tweetArr) # format file name
            writeToJsonlFile(tweetArr, dateRange, filename) # save json as a json file
            # reset 'global' vars
            tweetArr = []
            count = 0
        try:
            fetchedTweet = api.get_status(tweet)
            
            # search for keywords in tweet text
            if containsKeyword(fetchedTweet.text, keywords):
                tweetArr.append(fetchedTweet._json)
                count += 1 # maintain counter to know when the json file reaches 1000 COVID-19+Transportation tweets
                print(f"COVID-19+Transportation Tweet #{count}\n" + fetchedTweet.text)

        except Exception as e:
            # print("Exception Thrown: ", e)
            continue
    
    # if loop exits befor count == 1000
    if count > 0:
        dateRange = fetchStartAndEndDate(tweetArr, count) # save date of first and last tweet
        filename = formatFileName(tweetArr) # format file name
        writeToJsonlFile(tweetArr, dateRange, filename) # save json as a json file    

## Finally, it's time to call the Twitter API <a href name id='3'>
    
Using the functions we defined above, we are now ready to call `saveTransportationTweets()` on the COVID-19 dataset files.

In [21]:
time_range = ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12', 
            '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12', 
            '2022-01', '2022-02', '2022-03']

covidDataset = fetchAllTextFileURLs(time_range)
covidDataset = covidDataset[502:] # data collection starts from 1/21/2020 at 22:00.
print('Total files in COVID-10 Dataset:', len(covidDataset))
# show first 5 txt files
print('The first 3 automatically generated files:', covidDataset[:3])

Total files in COVID-10 Dataset: 19586
The first 3 automatically generated files: ['https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-01/coronavirus-tweet-id-2020-01-21-22.txt', 'https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-01/coronavirus-tweet-id-2020-01-21-23.txt', 'https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-01/coronavirus-tweet-id-2020-01-22-00.txt']


The following lines of code do 
- selects several .txt files to extract tweets from
- extracts tweet from each .txt file
- runs `saveTransportationTweets()` which rehydrates the Tweet object and saves the relevant tweets in .jsonl files
    - we define 'relevant' using the [keywords](#2b) defined earlier.

In [22]:
# selectedRange = covidDataset[8:15] # note that 1/21 happens at 24 * 21 = 504
# for file in selectedRange:
#     tweetIDs = getTweetIDsFrom(file)
#     if len(tweetIDs) > 0:
#         print('tweetIDs found')
#         saveTransportationTweets(tweetIDs, keywords)

### Dehydrate `jsonl` files back to `txt` file of tweet IDs  <a href name id='4'>

Now that we have our dataset of `COVID-19+Transportation` tweets, we can test out Twarc's `dehydrate` function to compress the tweets into just tweet IDs. Twarc is also a command line too, which means we can run 

    twarc dehydrate covid-mobility-tweet-starting-2020-1-21.jsonl > covid-mobility-tweet-starting-2020-1-21.txt
    
by utilizing the `os` library. According to the documentation, Twarc generates a text file containing the tweet IDs it finds based on the key `id_str`. The [source code](https://twarc-project.readthedocs.io/en/latest/api/client/) (for the Client method) is written below:
    

    def dehydrate(self, iterator):
    """
    Pass in an iterator of tweets' JSON and get back an iterator of the
    IDs of each tweet.
    """
    for line in iterator:
        try:
            yield json.loads(line)["id_str"]
        except Exception as e:
            log.error("uhoh: %s\n" % e)
  

In [23]:
# Initialize Twarc; not currently used in client
twarc = Twarc()

In [26]:
# Test dehydrate on a sample .jsonl file
os.system('twarc dehydrate covid-mobility-tweet-starting-2020-1-21.jsonl > covid-mobility-tweet-starting-2020-1-21.txt')

In [25]:
def findfile(name, path):
    '''
    Finds file in directory
    '''
    for dirpath, dirname, filename in os.walk(path):
        if name in filename:
            return os.path.join(dirpath, name)

filepath = findfile('covid-mobility-tweet-starting-2020-1-21.txt', "./")
print(filepath, '\n')

with open(filepath, 'r') as file:
    print(f"Tweet IDs in {filepath}")
    for line in file:
        print(line)

./covid-mobility-tweet-starting-2020-1-21.txt 

Tweet IDs in ./covid-mobility-tweet-starting-2020-1-21.txt


## Update: Data Collection through `general-collector.py` (4/7/2022) <a href name id='5'>
    
To collect data on the remote CS server, I reformatted my Jupyter Notebook into a Python script. I've been using the `nohup` command on Linux to constantly collect data from the COVID-19 dataset. Because of the difference between the notebook version and the script, I have had to make a few changes to this notebook, particularly in relation to the naming conventions. I've also uploaded my data collection script (`dataCollection.py`) in my `finalproject` folder. 

When creating new files:
1. Follow conventions to generate python file `collect-MMDDYYYY-MMDDYYYY.py`
2. Create folder with the same format `collect-MMDDYYYY-MMDDYYYY`
3. Copy and older python file. Change the following:
    1. Ctrl-f `collect-`. Change two log files and the folder path to the proper name
    1. Ctrl-f `.log"`. Check two log files have been changed
    2. Ctrl-f `selectedRange`. Change the selectedRange to the date range we're collecting from

Date Ranges currently being collected from:
- 3/11/2020 - 3/27/2020
- 3/28/2020 - 5/1/2020
- 5/2/2020 - 5/8/2020
- 7/23/2020 - 8/30/2020

- 11/24/2020 - 11/31/2020
- 12/11/2020 - 12/25/2020

In [10]:
# Call the general collector script
import os

def startGeneralCollector(collect_dateRange, startIndex, endIndex):
    scriptCommand = '../miniconda3/bin/python3 collection-scripts/general-collector.py'
    os.system(f'{scriptCommand} --collect_dateRange {collect_dateRange} --startIndex {startIndex} --endIndex {endIndex}')

In [1]:
# startGeneralCollector('collect-06072021-06132021', 12290, 12458)

I used [this spreadsheet](https://docs.google.com/spreadsheets/d/1hKOxDP_hTEeD7rBBV-W_REfw2Dt1pYwiJwfYwSy5u8Q/edit?usp=sharing) to keep track of the data collection periods and ensure I did not double collect data. There was no time to waste on silly mistakes (although I sure did make a few...)! The `general-collector` script was very helpful in streamlining the data collection process. The abstraction + spreadsheet combined allowed me to work 10x faster!

## Other tests [can ignore]

As mentioned previously, our script creates "urls" between 1/2020 and the current month (4/2022) from day 1 to 31 and hour 0 to 23. We used this to simplify the url generation process when accessing the COVID-19 Datset Github Repo. Because not all months have 31 days, and not all hours have collected data, we have an additional check that ensures we ignore false links.

In [15]:
# Get an idea of the month to month breakdown

# Data collection for the mont of January, 2020 ran from 1/21/2020 to 1/31/2020, 
# the 502th index to the 743 index of our covid dataset
print("January 2020", covidDataset[0].split('-')[6:], covidDataset[241].split('-')[6:])

# We iterate over all 24 hours and all '31' days (regardless of month)
# To find the entire month of data, we look for covidDataset[744(X-2)+241, 744(X-1)+241+1] where X = month #
print("February 2020", covidDataset[242].split('-')[6:], covidDataset[985].split('-')[6:])

print("March 2020", covidDataset[986].split('-')[6:], covidDataset[1729].split('-')[6:])

print("April 2020", covidDataset[1730].split('-')[6:], covidDataset[2473].split('-')[6:])

January 2020 ['2020', '01', '21', '22.txt'] ['2020', '01', '31', '23.txt']
February 2020 ['2020', '02', '01', '00.txt'] ['2020', '02', '31', '23.txt']
March 2020 ['2020', '03', '01', '00.txt'] ['2020', '03', '31', '23.txt']
April 2020 ['2020', '04', '01', '00.txt'] ['2020', '04', '31', '23.txt']


In [28]:
covidDataset[78]

'https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-01/coronavirus-tweet-id-2020-01-25-04.txt'

In [69]:
print(len(covidDataset))
print(covidDataset[7922])
# save the second value bc it's index-1
print(covidDataset[8281:8282])

19586
https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-12/coronavirus-tweet-id-2020-12-11-00.txt
['https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/2020-12/coronavirus-tweet-id-2020-12-25-23.txt']
