# PM2 Data Exploration

### Jenny Wang
Due Date: April 8, 2022

## Overview
This notebook begins to look at the raw data we've collected using our specific keywords. It goes through the collection tweets and also creates functions that combine tweet files from the same date. The most important part of the notebook is the section with "Functions for Data Analysis". These functions are tested in the notebook, then wrapped into a Class called `FileSynthesizer` which we've put in a helper module called `DataRetrieval`. `FileSynthesizer` is used in the Data Cleaning notebook to simply the process of aggregating tweets (since we need to update the files frequently while data is being collected).

### Instructions:
You should create two separate notebooks: 
1. for collecting data --> The data collection notebook shows how you used APIs or web scraping to get the data (or part of it, if you will rely on a bigger dataset from someone else)
2. **for exploring data --> Your data exploration notebook should have code that generates various statistics: how many data points, what does each datapoint look like, what kind of features it has, what are some things you can say about these features, is there any cleaning that you need to do, etc.**

If the data allows it, you can try to create visualizations using one of the visualization libraries that were introduced in WT1 (Matplotlib, Plotly, or Seaborn). These visualizations can be used in the paper, if they reveal interesting things about the data.


### Table of Contents
1. [Explore COVID-19 Dataset](#1)
2. [COVID-19+Transportation Data Collected](#0.1)
3. [COVID-19+Transportation Data Exploration](#2)
4. [Functions for Data Analysis](#3)
    1. [Observations](#3.1)
5. [COVID-19+Transportation Data Analysis](#4)
     1. [Use NLTK to calculate frequency distributions](#2.1)
     2. [Simple Topic Modeling and Visualization](#2.2)

## Explore COVID-19 Dataset <a href name id="1">

During our data collection, we found that the COVID-19 Github repo contains 28 folders (one folder for each month between January 2020 - April 2022). Each folder contains text files from each hour in the day, from the first day of the month to the last day of the month. Thus, each folder contains at most 24 * 31 = 744 text files. Thus, the goal is to eventually search through roughly 20832 text files containing thousands of tweets.

As of v2.93, the total number of tweets in the dataset is 2,422,918,491.

I copy in a few functions from the data collection to generate statistics below:

In [1]:
import os
import json
import requests
from datetime import date, timedelta

In [2]:
# Relevant functions from data collection

# get outer directories
base_url = 'https://github.com/echen102/COVID-19-TweetIDs/tree/master/'
time_range = ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12', 
            '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12', 
            '2022-01', '2022-02', '2022-03', '2022-04']

def fetchAllTextFileURLs(time_range):
    files = []
    base_url = 'https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/'
    prefix = 'coronavirus-tweet-id'
    dates = list(range(1, 32)) # day 1 to 31
    hours = list(range(24)) # hour 0 to 23
    
    for yyyyMM in time_range:
        for date in dates:
            date = f'0{date}' if date < 10 else f'{date}'
            for hour in hours:
                hour = f'0{hour}' if hour < 10 else f'{hour}'
                fileURL = f'{base_url}{yyyyMM}/{prefix}-{yyyyMM}-{date}-{hour}.txt'
                files.append(fileURL)

                # print(fileURL)
    
    return files

def getTweetIDsFrom(txtFile):
    '''
    Retrieves tweet IDs from the text file's URL and stores in list
    '''
    tweetList = []
    response = requests.get(txtFile) # requests raw file from Github
    
    if response.status_code == 200:
        content = response.content.decode('utf-8')
        tweetList = content.split()
        
    return tweetList

In [3]:
covidDataset = fetchAllTextFileURLs(time_range)

print('Total files in selected time range:', len(covidDataset))
print('Note: not all urls generated are valid. Some text files are missing or represent imaginary dates e.g. February 31)')

Total files in selected time range: 20832
Note: not all urls generated are valid. Some text files are missing or represent imaginary dates e.g. February 31)


### Size of text files
To avoid requesting over 20,000 URLs and searching their lengths (which would take a very, very long time...), I run a random query of 10 text files. I show the number of tweets per text file below.

In [4]:
import random
from random import randint

random.seed(4)

for _ in range(10):
    end = len(covidDataset)
    i = randint(0, end)
    tweets = getTweetIDsFrom(covidDataset[i])
    
    print(f'{len(tweets)} tweets collected from text file {covidDataset[i][97:len(covidDataset[i])]}')

164560 tweets collected from text file 2020-11-13-06.txt
143402 tweets collected from text file 2021-02-12-02.txt
48043 tweets collected from text file 2020-05-17-20.txt
149146 tweets collected from text file 2021-06-14-18.txt
96641 tweets collected from text file 2021-10-03-19.txt
180103 tweets collected from text file 2020-07-26-14.txt
0 tweets collected from text file 2020-04-31-00.txt
37342 tweets collected from text file 2020-03-29-19.txt
36691 tweets collected from text file 2020-01-28-01.txt
114090 tweets collected from text file 2021-06-22-07.txt


The largest file from our random search of 10 tweets contains **180,103 tweets**. It was collected on July 26, 2020, when we were in the midst of the pandemic's first wave. Note that the file which printed 0 tweets was from `2020-04-31-00`, a date that does not exist.

## COVID-19+Transportation Data Collected <a href name id="0.1">
    
The COVID-19+Transportation Dataset collects data with the following steps:
- open a COVID-19 tweet file
- run through the tweet IDs within the file
    - rehydrate the tweet
    - search for transportation keywords based on a predetermined list
    - if a transportation keyword is found, add it to the list. Otherwise, move on to the next tweet
    - when either (1) the file end is reached or (2) the list of COVID-19+Transportation tweets exceeds 1000, we save the tweets in a `.jsonl` file.
    
As of 4/9/2022, I have set up `nohup` to run on several threads in order to collect data from various date ranges. I have obtained relevant data from 1/22/2020 - 1/25/2020. The scripts are currently collecting from the following 6 date ranges:
- 3/11/2020 - 3/27/2020
- 3/28/2020 - 5/1/2020
- 5/2/2020 - 5/8/2020
- 7/23/2020 - 8/30/2020
- 11/24/2020 - 11/31/2020
- 12/11/2020 - 12/25/2020
    
I will continue to increase the number of date ranges in the coming weeks. 

#### File Breakdown
I have created a data collection script for each time range, with the naming convention of `collect-ddMMyyyy-ddMMyyyy.py` where the first `ddMMyyyy` represents the start date at hour 0, and the last `ddMMyyyy` represents the end date at hour 23. For example, `collect-11242020-11312020.py` collects data from November 24, 2020 at 00:00 until November 31, 2020 at 23:00. 

The tweets are stored in a directory of the same name, `collect-ddMMyyyy-ddMMyyyy`. A log file called `collect-ddMMyyyy-ddMMyyyy.log` is also created. Each tweet file follows the naming convention `covid-mobility-tweet-starting-yyyy-MM-dd~HH:mm:ss.json`

#### Count the number of tweets collected

In [5]:
def countNumberOfTweets(filePath):
    '''
    Counts the number of tweets in a given json file.
    '''
    with open(filePath, 'r') as file:
        result = []
        jsonList = list(file)
    
    return len(jsonList) - 1 # subtracts the first line

The following lines of code print out the number of lines in each file.

In [6]:
# get directories

path = os.getcwd()+'/collection-tweets'
print(path)
print(os.listdir(path))

# directories = list(filter(os.path.isdir, os.listdir(path))) # old directories before I moved the folders
directories = list(os.listdir(path))
# print(directories)
print("Tweet Counts")
total = 0
for fileDir in directories:
    if fileDir[0] != '.':
        for file in os.listdir(f'{path}/{fileDir}'):
            filePath = f'{path}/{fileDir}/{file}'
            with open(filePath) as f:
                tweetCount = countNumberOfTweets(filePath)
                total += tweetCount if tweetCount < 1000 else 0 # each file should max out at 1000 tweets. If there are more, the data format is faulty -- to be fixed
#                 print(f'File: {file}: {tweetCount} \t', end="")
#                 print(f'{tweetCount} \t', end="")
    
print("\nTotal: ", total)

/students/jw10/cs315/collection-tweets
['collect-01132022-01162022', 'collect-12272022-01022022', 'collect-05142020-05172020', 'collect-12202021-12262021', 'collect-10212021-10242021', 'collect-08022020-08052020', 'collect-01102022-01162022', 'collect-01182021-01242021', 'collect-11152021-11212021', 'collect-03152021-03212021', 'collect-04192021-04252021', 'collect-04062020-04122020', 'collect-11052020-11082020', 'collect-02172022-02202022', 'collect-10192020-10212020', 'collect-07302020-08012020', 'collect-01222020-01252020', 'collect-11222021-11282021', 'collect-03172022-03202022', 'collect-11252021-11282021', 'collect-09022021-09052021', 'collect-09072020-09092020', 'collect-03142022-03202022', 'collect-07082021-07112021', 'collect-06012020-06032020', 'collect-07232020-08302020', 'collect-06072021-06132021', 'collect-03312022-04032022', 'collect-12282020-01032021', 'collect-09242020-09272020', 'collect-07262021-08012021', 'collect-06182020-06212020', 'collect-10082020-10112020', 'co

As you can see, most files contain between 3 and 40 Covid-19+Transportation tweets per hour. However, two files have above 1000 lines of code (9920 and 4401, apparently). Upon further inspection, this is caused by the way the tweets appear in the editor, with  `\n` characters separating each curly brace `{`. For now, I will remove these files from the analysis while I figure out how to remove the redundant newline characters.

In [7]:
def saveDirectoryFileNames(directory):
    '''
    Retrieves all files from the given directory.
    '''
    # get directories
    files = list(os.listdir(directory))

    return files

testDir = '/students/jw10/cs315/collection-tweets/collect-01182021-01242021'
filePaths = saveDirectoryFileNames(testDir)

print("Number of hydrated tweet files generated from single directory:", len(filePaths))

Number of hydrated tweet files generated from single directory: 57


In [8]:
# Changes file extensions from .json to .jsonl for relevant directories.
import os, sys

path = '/students/jw10/cs315/collection-tweets/collect-12112020-12252020'

def changeFileExtension(fullPath):
    '''
    Given an absolute path, changes all files with the .json extension to .jsonl. This function can be reused 
    for other extension names. 
    
    https://stackoverflow.com/questions/16736080/change-the-file-extension-for-files-in-a-folder
    '''
    for filename in os.listdir(path):
        base_file, ext = os.path.splitext(filename)
        print(base_file)
        if ext == ".json":
            os.rename(f"{path}/{filename}", f"{path}/{base_file}" + ".jsonl")

## COVID-19+Transportation Data Exploration <a href name id="2">

Currently, the data collection script is still undergoing minor bug fixes. So far, I have collected data from select dates throughout the pandemic. The iteration process is quite slow. I will structure this exploration file so that it is still compatible with the new data I hope to collect after submitting this task for PM2.
    
We use the functions written above to conduct some initial analysis.

In [9]:
# gets files from a single directory
basePath = '/students/jw10/cs315/collection-tweets/'
testDir = '/students/jw10/cs315/collection-tweets/collect-01182021-01242021'
directory = 'collect-01182021-01242021'
filePaths = saveDirectoryFileNames(testDir)

print("Number of hydrated tweet files generated:", len(filePaths))

Number of hydrated tweet files generated: 57


In [10]:
def openHydratedTweetFile(basePath, fileDir, fileName):
    '''
    Opens a hydrated tweet file from a directory
    '''
    
    with open(f'{basePath}{fileDir}/{fileName}', 'r') as file:
        fileText = file.read()
        arr = fileText.split('\n')
        strippedArr = arr[1:len(arr)-1]
         
        return strippedArr

In [11]:
for filePath in filePaths:
    openHydratedTweetFile(basePath, directory, filePath)
        
#         print(len(arr[1:len(arr)-1]))
#         jsonItem = json.loads(fileText)
#         print(fileText)
#         print(jsonItem)
#         jsonList = list(file)
#         for i, item in enumerate(jsonList[1:]):
#             jsonItem = json.loads(item)
#             print(f"Tweet {i}:",jsonItem['text'])

#### Combine files from the same date

The next few functions will be used when I begin merging the hourly files. Since we only have tweets from a few dates, I will work on these functions further when they are of more use. For now, it is enough to look at the tweets file by file.

## Functions for Data Analysis <a href name id="3">
#### [Skip to COVID-19+Transportation Data Analysis](#4)

In [12]:
def generateDateRange(start_date, end_date):
    '''
    Given a start and end date, returns a generator object of 
    https://stackoverflow.com/questions/1060279/iterating-through-a-range-of-dates-in-python
    '''
    for n in range(int((end_date - start_date).days)):
        yield start_date + timedelta(n)
        
# print(getDateRange('2020-07-23','2020-07-29' ))

def getDirectoryDates(fileDir):
    '''
    Stores the date range of tweets in the given directory as a list, 
    based off the directory's naming conventions `collect-MMddyyyy-MMddyyyy`
    '''
    dates = []

    dateRangeStr = fileDir.split('-')[1:3] # extracts date range
    sMonth, sDay, sYear = int(dateRangeStr[0][0:2]), int(dateRangeStr[0][2:4]), int(dateRangeStr[0][4:8])
    eMonth, eDay, eYear = int(dateRangeStr[1][0:2]), int(dateRangeStr[1][2:4]), int(dateRangeStr[1][4:8])
#     print(f'{sYear}/{sMonth}/{sDay} - {eYear}/{eMonth}/{eDay}')
    startDate, endDate = date(sYear, sMonth, sDay), date(eYear, eMonth, eDay)
    
    for singleDate in generateDateRange(startDate, endDate):
        dates.append(str(singleDate))
    
    return dates

def groupFilesFromSameDate(basePath, fileDir):
    '''
    Given a directory, groups files from the same date using a dictionary where the keys=dates and 
    values=file names.
    
    This function assumes that 'covid-mobility-tweet-starting-yyyy-MM-dd~HH:ss:mm.json' 
    represents the proper naming convention for each file. I expect that each file directory only contains 
    tweets within the range specified by its naming convention `collect-MMddyyyy-MMddyyyy`
    
    Parameters:
    fileDir = path to the directory housing tweets from date range. Path likely follows the format:
    '/students/jw10/cs315/collection-tweets/collect-MMddyyyy-MMddyyyy'
   
    '''
    dateDict = {}
    
    for file in os.listdir(f'{basePath}{fileDir}'):
        date = file[30:30+9].strip('-') # very unsafe but assumes naming conventions remain consistent
        date = date.strip('~')
        if date not in dateDict:
            dateDict[date] = [file]
        else:
            dateDict[date].append(file)
    
    return dateDict

def createLargeFileFromSameDateFiles(basePath, fileDir, writeToPath):
    '''
    Combines the hourly files into a large file for a given date containing the tweets from hours 0 - 23.
       GET TO THIS::: https://galea.medium.com/how-to-love-jsonl-using-json-line-format-in-your-workflow-b6884f65175b

    Parameters:
    basePath - entire path to ./.../.../collection-tweets folder
    fileDir - 'collect-MMddyyyy-MMddyyyy' inside the collection-tweets folder.
    '''

    largeFileTweets = []
    dateDict = groupFilesFromSameDate(basePath, fileDir)
    
    for date in dateDict:
        for file in dateDict[date]:
            tweetArr = openHydratedTweetFile(basePath, fileDir, file)
            largeFileTweets.extend(tweetArr)
        
        print("creating file:", f"{writeToPath}/covid-mobility-tweet-{date}.jsonl")
        with open(f"{writeToPath}/covid-mobility-tweet-{date}.jsonl", 'w') as outfile:
            for entry in largeFileTweets:
                outfile.write(entry)
                outfile.write('\n')
                
        largeFileTweets = [] # reset for next date
        
def getAllTweets(path, writeToPath):
    '''
    Generates large tweet file from daily tweets folder
    path=/students/jw10/cs315/tweets-by-day
    '''
    largeFileTweets = []
    for file in os.listdir(path):
        print("File: ", file)
        tweetArr = openHydratedTweetFile(path, "", file)
        largeFileTweets.extend(tweetArr)
        
    print("creating file:", f"{writeToPath}/covid-mobility-tweet-all.jsonl")
    with open(f"{writeToPath}/covid-mobility-tweet-all.jsonl", 'w') as outfile:
        for entry in largeFileTweets:
            outfile.write(entry)
            outfile.write('\n')
# dates = getDirectoryDates('collect-07232020-08302020')
# print(dates)


In [13]:
basePath='/students/jw10/cs315/tweets-by-day'
writeToPath = '/students/jw10/cs315/all-tweets'

getAllTweets(basePath, writeToPath)

File:  covid-mobility-tweet-2020-6-1.jsonl
File:  covid-mobility-tweet-2020-12-3.jsonl
File:  covid-mobility-tweet-2021-2-22.jsonl
File:  covid-mobility-tweet-2020-4-10.jsonl
File:  covid-mobility-tweet-2020-11-6.jsonl
File:  covid-mobility-tweet-2021-10-2.jsonl
File:  covid-mobility-tweet-2020-6-2.jsonl
File:  covid-mobility-tweet-2020-5-2.jsonl
File:  covid-mobility-tweet-2022-3-14.jsonl
File:  covid-mobility-tweet-2021-9-3.jsonl
File:  covid-mobility-tweet-2021-12-1.jsonl
File:  covid-mobility-tweet-2021-8-9.jsonl
File:  covid-mobility-tweet-2021-1-19.jsonl
File:  covid-mobility-tweet-2020-11-2.jsonl
File:  covid-mobility-tweet-2020-3-30.jsonl
File:  covid-mobility-tweet-2020-5-16.jsonl
File:  covid-mobility-tweet-2020-1-22.jsonl
File:  covid-mobility-tweet-2022-3-30.jsonl
File:  covid-mobility-tweet-2021-10-1.jsonl
File:  covid-mobility-tweet-2021-7-8.jsonl
File:  covid-mobility-tweet-2020-3-28.jsonl
File:  covid-mobility-tweet-2020-3-26.jsonl
File:  covid-mobility-tweet-2022-2-4.j

In [14]:
# Tests of the above function
basePath = '/students/jw10/cs315/collection-tweets/'
fileDir = 'collect-05022020-05082020'

writeToPath = '/students/jw10/cs315/tweets-by-day'
testArr = createLargeFileFromSameDateFiles(basePath, fileDir, writeToPath)

# basePath = '/students/jw10/cs315/collection-tweets/'
# fileDir = 'collect-01222020-01252020'

# writeToPath = '/students/jw10/cs315/tweets-by-day'
# testArr = createLargeFileFromSameDateFiles(basePath, fileDir, writeToPath)


creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-7.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-3.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-4.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-2.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-6.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-8.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-5.jsonl


In [15]:
# Function to combine all tweet collectors into a single day's worth of files
def combineHourlyTweetsToDaily(basePath, writeToPath):
    '''
    Takes all folders within the /collection-tweets directory and adds all hourly files of the same date to a single
    folder. 
    
    Parameters:
    basePath - use this basePath: basePath = '/students/jw10/cs315/collection-tweets/' but use 
        basePath = '/students/jw10/cs315/local-collection-tweets/' after the desktop is done collecting data
    writeToPath - writeToPath = '/students/jw10/cs315/tweets-by-day'

    '''
    fileDirs = os.listdir(basePath)

    for fileDir in fileDirs:
        createLargeFileFromSameDateFiles(basePath, fileDir, writeToPath)
        
combineHourlyTweetsToDaily('/students/jw10/cs315/collection-tweets/', '/students/jw10/cs315/tweets-by-day')

creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2022-1-14.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2022-1-13.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-12-2.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-14.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-15.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-5-16.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-12-2.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-10-2.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-8-3.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-8-2.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2022-1-10.jsonl
creating file: /students/jw10/cs315/tweets-by

creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-10-2.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-8-31.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-8-30.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-2-24.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-2-22.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-2-23.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-11-5.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-11-4.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-4-6.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-4-5.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-4-7.jsonl
creating file: /students/jw10/cs315/tweets-by-

creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-4-5.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-3-31.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-4-4.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-3-30.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-4-7.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-4-1.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2020-3-29.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-5-24.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-5-25.jsonl
creating file: /students/jw10/cs315/tweets-by-day/covid-mobility-tweet-2021-5-26.jsonl


### Observations <a href name id="3.1">
It appears that many of these tweets are truncated. I realized that the Tweepy API's default does not include "extended text", so I have to go back and rehydrate these tweets using `api.get_status(tweet, tweet_mode='extended')`. For the next few scripts that I run, I will fix this internally, but for the tweets that have already been generated, I will re-retrieve the tweets by tweet ID in this notebook before running analysis.

I also noticed that there are quite a few repeat retweets appearing. Because retweets often indicate agreement or further discourse, I will count them as their own tweet entities.

Another issue is that our transportation keywords captured more than just mobility/transportation tweets. The word `line` is found quite a bit, but rarely relates to transportation. For example, `Guess this is the new party line cause it sure ain't this thing called "truth." https://t.co/rkB5Gwzal4` uses the word "line" in relation to political party lines. My exploration into other files also shows that `line` is rarely used in the transportation context. Even so, I think it's more important to capture this word in case transportation tweets are found. We will handle any saved tweets unrelated to transportation after the initial data collection phase (during preprocessing).

From other files, it also appears that some of these tweets may be duplicates from the same individual. I noticed this especially on 1/24/2020. It's probably a good idea to do some filtering in case those are spam accounts or bots which could bias our analysis. 

## COVID-19+Transportation Data Analysis <a href name id="4">

Here we will generate the functions to analyze a simple NLTK tokenizer. This will give us a better idea of what words are used the most and how we can proceed during the filtering process.

## Use NLTK to calculate frequency distributions <a href name id="2.1">

We will use NLTK to process this single document of tweets from `collect-07232020-08302020/covid-mobility-tweet-starting-2020-7-23~02:00:46.json`, then we will do the same analysis on the folder of `.json` files: `collect-07232020-08302020`.

In [16]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /students/jw10/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
def createDocument(filePath):
    '''
    Creates a document out of a single hour of tweet IDs
    '''
    with open(filePath, 'r') as file:
        document = ''
        jsonList = list(file)
        
        for item in jsonList[1:]:
            jsonItem = json.loads(item)
            document += jsonItem['text']
        
    return document
        

In [18]:
document = createDocument(filePaths[0])
tokens = nltk.Text(document.split()) # tokenize with a general tokenizer class
print(tokens[:50])

print("Length of tokens",len(tokens))

FileNotFoundError: [Errno 2] No such file or directory: 'covid-mobility-tweet-starting-2021-1-18~02:01:22.jsonl'

#### Tweet Tokenizer

In [None]:
import string
from nltk.tokenize import TweetTokenizer
from collections import Counter

stopwordsList = nltk.corpus.stopwords.words('english')
punctuation = string.punctuation

def cleanTweets(someTweets):
    """Given a string that it's a tweet or many tweets joined together,
    clean it up to use for further analysis.
    """
    # Your code here
    # 1) lowercase tweet words
    loweredTweets = someTweets.lower()
    
    # 2) tokenize tweets
    tweet_tokenizer = TweetTokenizer()
    tokens = tweet_tokenizer.tokenize(loweredTweets)
    
    # 3) Remove stopwords
    noStopWords = [w for w in tokens if w not in stopwordsList]
    # 4) Remove punctuation
    noPunctuation = [w for w in noStopWords if w not in punctuation]
    
    #5) Remove http symbol
    noHTTP = [w for w in noPunctuation if w not in 'http']

    #5.1) Remove ellipses (this should be unnecessary after "extending" the tweets)
    cleanTweets = [w for w in noHTTP if w not in 'rt … “ ’']
    
    
    return cleanTweets

In [None]:
cleanedDoc = cleanTweets(document)

In [None]:
Counter(cleanedDoc).most_common(10)

#### Now we do this on a larger document

We run through all hydrated tweet files generated to obtain a large bag of words document. 

In [None]:
def createLargeDocument(directory):
    document = ''
    for filePath in directory:
        if "bb" in filePath:
            continue
#         print(filePath)
        with open(filePath, 'r') as file:
            jsonList = list(file)

            for item in jsonList[1:]:
                jsonItem = json.loads(item)
                document += jsonItem['text']
                
    return document

In [None]:
largeDocument = createLargeDocument(filePaths)

print(len(largeDocument))

In [None]:
cleanedLargeDoc = cleanTweets(largeDocument)

In [None]:
Counter(cleanedLargeDoc).most_common(10)

#### Observations 
The most common words provide us with little understanding of what is going on in the tweets. In fact, the most common words are focused on only COVID-19 (and we know this is a COVID-19 dataset)! Thus, we must conduct further preprocessing to narrow down the **transportation** words used, and generate more comprehensive topics related to both transportation and COVID-19. However, this does bring up an important problem I didn't consider before. Since the COVID-19 dataset only captures tweets which contain COVID-19 words, we might be missing a lot of data about transportation during this time period that relates to coronavirus events, but does not use the specific language. Even so, my hope is that more data and better NLP modeling techniques can help generate some meaning in this jumble!

## Simple Topic Modeling and Visualization <a href name id="2.2">
Next, we use the pyLDAvis library to generate an LDA model word visualization using TD-IDF. I will plan to separate my tweets into documents based on my transportation keywords so that I am looking for topics among several documents built around a certain keyword rather than the bag-of-words approach. I will look into this deeper in PM3. 

In [32]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

In [33]:
# Create the TF vector represetnation, this only counts the terms in each document

tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)

dtm_tf = tf_vectorizer.fit_transform(cleanedLargeDoc)

tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(cleanedLargeDoc)
print(dtm_tfidf.shape)



(43630, 593)


In [35]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tf.fit(dtm_tf)

lda_tf

LatentDirichletAllocation(n_components=20, random_state=0)

In [36]:
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


The MDS visualization doesn't really provide us with much insight, so we use MMDS to look at the data in more dimensions.

In [37]:
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer, mds='mmds')

  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
