# COVID-19+Transportation Dataset Creation

In this notebook, we generate the public files/folders for the `COVID-19-Transportation-TweetIDs` GitHub Repo and use `twarc` to dehydrate our `.jsonl` files into `.txt` files which we then place into the public folders. We aim to follow the same conventions as Chen, Lerman, and Ferrara's COVID-19 Pandemic Twitter Data Set. At the end of this notebook, we clone the local repository to my GitHub. The public dataset can be found here: GitHub Repo: 

[COVID-19-Transportation-TweetIDs](https://github.com/jennyw23/COVID-19-Transportation-TweetIDs).

Since I have already prepared my files in a way that simplifies the "condensing files for the dataset" step, this notebook is not very extensive. However, I've still included a short Table of Contents if that is helpful.

## Table of Contents
1. [Create directories to store the files by month](#1)
2. [Dehydrate `.jsonl` files into the directories for the public dataset](#2)
3. [The End :)](#3)

In [1]:
import os
import sys
sys.path.append(".")
import helper.DataRetrieval as dr

In [2]:
# List entire tweet time range
time_range = ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12', 
            '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12', 
            '2022-01', '2022-02', '2022-03', '2022-04']

## Create directories to store the files by month <a href name id='1'>

In [6]:
def generatePublicMonthDirectories():
    time_range = ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12', 
            '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12', 
            '2022-01', '2022-02', '2022-03', '2022-04']

    publicPath = '/students/jw10/cs315/COVID-19-Transportation-TweetIDs'
    for month in time_range:
        monthDir = f'{publicPath}/{month}'
        if not os.path.exists(f'{publicPath}/{monthDir}'):
            os.makedirs(monthDir)
            
def getYearMonth(filename):
    components = filename.split('-')
    YYYY, MM = components[3], components[4]
    
    MM = MM if int(MM) > 9 else f'0{MM}' # add 0 in front of month digits between 1-9
    
    return f'{YYYY}-{MM}'

def standardizeFileName(filename, extension):
    components = filename.split('-')
    MM, dd = components[4], components[5].split('.')[0]
    MM = MM if int(MM) > 9 else f'0{MM}' # add 0 in front of month digits between 1-9
    dd = dd if int(dd) > 9 else f'0{dd}' # add 0 in front of day digits between 1-9
    root = "-".join(components[:4])

    return f'{root}-{MM}-{dd}.{extension}'

def dehydrateToPublicDir(basePath, fileName):
    publicDirBasePath = '/students/jw10/cs315/COVID-19-Transportation-TweetIDs'
    # get subdirectory of public dir
    yearMonth = getYearMonth(fileName)
    # standardize two digits for txt file name
    txtFileName = standardizeFileName(fileName, 'txt')
    # dehydrate tweet and place in public dir
    twarcCommand = f'twarc dehydrate {basePath}/{fileName} > {txtFileName}'
    
    os.system(twarcCommand)
    
# generatePublicMonthDirectories()
# standardizeFileName('covid-mobility-tweet-2021-5-1.jsonl', 'txt')

## Dehydrate `.jsonl` files into the directories for the public dataset <a href name id='2'>

In [5]:
tweetsByDayPath = '/students/jw10/cs315/tweets-by-day'
publicdir = '/students/jw10/cs315/COVID-19-Transportation-TweetIDs'

for dayFile in os.listdir(tweetsByDayPath):
    # access tweet files within the directory
    dehydrateToPublicDir(tweetsByDayPath, dayFile)

## The End :) <a href name id='3'>
    
Now we have our COVID-19+Transportation TweetIDs stored in readable directories with proper naming conventions. We're now ready to upload it to a GitHub repo!