# Part 3 - Data Cleaning

## Overview
This notebook utilizes the functions we created in the "Data Exploration" notebook through the module `DataRetrieval`. Our goal in this notebook is to combine all the hourly tweet files into larger documents. We will save the collection tweets in files (1) by day, (2) by season (i.e. spring, summer, winter, fall), and (3) by the whole time range.

After the data cleaning step, we move on to preprocessing, which is in the notebook `PM3 - Data Preprocessing`.

### Table of Contents
1. [Motivation](#1)
    1. [Save collection tweets into larger files to prepare for large data analysis](#1a)
    2. [Open the .jsonl file containing all transportation tweets ](#1b)
    3. [Observe some of the text data](#1c)
    4. [Dehydrate and rehydrate tweets to obtain full_text field](#1d)
    5. [Observe tweet text after rehydrating](#1e)
2. [Creating a tweet classifier](#2)
    1. [Classifier to remove Words With Multiple Meanings](#2a)
    2. [Observe some irrelevant tweets to see if we're capturing accurate information](#2b)
    3. [Create a classifier which removes non-English tweets](#2c)
3. [Store relevant tweet IDs in .txt file ](#3)
    1. [Rehydrate our relevant tweets](#3a)

## Motivation <a href name id="1">
Now that I have collected a sizeable amount of data, I want to clean some of the tweets that do not contain transportation relevant tweets which I found in PM2:

    Another issue is that our transportation keywords captured more than just mobility/transportation tweets. The word "line" is found quite a bit, but rarely relates to transportation. For example, Guess this is the new party line cause it sure ain't this thing called "truth." https://t.co/rkB5Gwzal4 uses the word "line" in relation to political party lines. My exploration into other files also shows that line is rarely used in the transportation context. Even so, I think it's more important to capture this word in case transportation tweets are found. We will handle any saved tweets unrelated to transportation after the initial data collection phase (during preprocessing).

    From other files, it also appears that some of these tweets may be duplicates from the same individual. I noticed this especially on 1/24/2020. It's probably a good idea to do some filtering in case those are spam accounts or bots which could bias our analysis. 
    
In this notebook, I will tackle this issue.

In [1]:
import os
import json
import random
import sys
sys.path.append(".")
import helper.DataRetrieval as dr

In [2]:
# # Example of using DataRetrieval module
# # Gets a list of all text files in COVID-19 dataset
# time_range = ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12', 
#             '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12', 
#             '2022-01', '2022-02', '2022-03', '2022-04']
# covidDataset = dr.COVIDDataRetriever().fetchAllTextFileURLs(time_range)

# covidDataset[:10]

### Save collection tweets into larger files to prepare for large data analysis. <a href name id="1a">

In [3]:
# Saves all tweets collecting on CS server into daily files

basePath = '/students/jw10/cs315/collection-tweets/'
writeToPath = '/students/jw10/cs315/tweets-by-day'
dr.FileSynthesizer().combineHourlyTweetsToDaily(basePath, writeToPath, printOut=False)

In [4]:
### NOTE: ALL LOCAL FILES HAVE BEEN ADDED TO THE `COLLECTION-LOGS` DIRECTORY. NO LONGER NEED TO RUN THIS
# # Saves all tweets in local collection files into daily files
# basePath = '/students/jw10/cs315/local-collection-tweets/'
# writeToPath = '/students/jw10/cs315/tweets-by-day'
# dr.FileSynthesizer().combineHourlyTweetsToDaily(basePath, writeToPath)

In [5]:
# Save all tweets into one large file (uses the daily files we need to generate in the above cells)
basePath='/students/jw10/cs315/tweets-by-day'
writeToPath = '/students/jw10/cs315/all-tweets'

dr.FileSynthesizer().getAllTweets(basePath, writeToPath)

### Open the .jsonl file containing all transportation tweets <a href name id="1b">

In [6]:
allTweets = []
with open('/students/jw10/cs315/all-tweets/covid-mobility-tweet-all.jsonl') as f:
    for line in f:
        allTweets.append(json.loads(line))

len(allTweets)

16936

### Observe some of the text data <a href name id="1c">

In [7]:
def returnLoweredTweetText(tweet):
    if "retweeted_status" not in tweet: 
        if 'full_text' not in tweet:
            return tweet['text'].lower()
        else:
            return tweet['full_text'].lower()
    else:
        if 'full_text' not in tweet['retweeted_status']:
            return tweet['retweeted_status']['text'].lower()
        else:
            return tweet['retweeted_status']['full_text'].lower()

def printTweetText(tweet):
    print('*** Tweet ***')
    if "retweeted_status" not in tweet: 
        if 'full_text' not in tweet:
            print(tweet['text'])
        else:
            print(tweet['full_text'])
    else:
        if 'full_text' not in tweet['retweeted_status']:
            print(tweet['retweeted_status']['text'])
        else:
            print(tweet['retweeted_status']['full_text'])

In [8]:
random.seed(2) # I'm doing this so that the sequence of random numbers is replicable

for i in range(3):
    tweet = random.choice(allTweets)
    printTweetText(tweet)
    print(80*'-')   

*** Tweet ***
This is what #Thanksgiving traffic looked like in Los Angeles before COVID-19 https://t.co/IS91onrhRU
--------------------------------------------------------------------------------
*** Tweet ***
@TimSmithMP 690+ of which were in Federal (LNP) Government run aged care facilities. ZERO covid deaths in state run aged car facilities in Vic.
--------------------------------------------------------------------------------
*** Tweet ***
Closing the station because everyone has COVID and then blaming it on “defunding” that 100% has not happened is the most cop shit ever. They know their supporters will run will a false narrative, even if they know it’s clearly false. Incredible. https://t.co/33hUqAH0Cj
--------------------------------------------------------------------------------


We notice that a lot of the tweets are cut off. This was an observation I made in the last milestone when I found out that you must add a parameter called `extend_mode` to the tweepy `getStatus` request. We will dehydrate our transportation files into their tweet IDs and then rehydrate them to see if we get the full text.

### Dehydrate and rehydrate tweets to obtain `full_text` field. <a href name id="1d">

In [9]:
# Dehydrate full list of tweets .jsonl file
os.system('twarc dehydrate /students/jw10/cs315/all-tweets/covid-mobility-tweet-all.jsonl > /students/jw10/cs315/all-tweets/covid-mobility-tweet-all.txt')

0

In [10]:
# Rehydrate the full list of tweets from .txt. to .jsonl file to get full text.

os.system('twarc hydrate /students/jw10/cs315/all-tweets/covid-mobility-tweet-all.txt > /students/jw10/cs315/all-tweets/covid-mobility-tweet-all-rehydrated.jsonl' )

0

In [11]:
allTweets = []
with open('/students/jw10/cs315/all-tweets/covid-mobility-tweet-all-rehydrated.jsonl') as f:
    for line in f:
        allTweets.append(json.loads(line))

len(allTweets)

16789

**Observations**: It appears that we have lost ~30 tweets that were in our initial `covid-mobility-tweet-all.jsonl` file. It is possible that these were repeat tweets which `twarc` removed on its own or that they were incorrectly formatted. For now, we will ignore this because it does not make a large impact on the size of our dataset.

### Observe tweet text after rehydrating <a href name id="1e">

In [12]:
random.seed(2) # I'm doing this so that the sequence of random numbers is replicable

for i in range(5):
    tweet = random.choice(allTweets)
    printTweetText(tweet)
    print(80*'-')   

*** Tweet ***
COVID Update July 22: The “anti” culture in America is a long run hazard to public health &amp; a near term hazard to yours.

We need to address it &amp; turn it around.  1/
--------------------------------------------------------------------------------
*** Tweet ***
Yes @JustinTrudeau , take the #AstraZeneca #vaccine ! If you take it, I’ll take it also. Now it’s the time to walk the talk! #cdnpoli #GetVaccinated #Covid19 #Canada #Ottawa #ableg #abpoli #onpoli #bcpoli #Yeg #yyc #COVID19Ontario #LPC @gmbutts #VaccineForAll #Toronto #Ontario https://t.co/kKcTQ60g3m
--------------------------------------------------------------------------------
*** Tweet ***
#COVID19 | In Cork on Sunday, gardaí broke up a score of road bowling on roads in the Bottlehill area of Burnfort, just south of Mallow, with up to 50 people in attendance reports @EoinBearla
https://t.co/3oQtthZlP4
--------------------------------------------------------------------------------
*** Tweet ***
New Covid

## Creating a tweet classifier <a href name id="2">

Some of these tweets are unrelated to transportation. We build a simple classifier to test whether or not these tweets are relevant.

In [12]:
import pandas as pd
import inflect

p = inflect.engine()

def saveKeywords():
    keywords = pd.read_csv('./cs315_keywords.csv')
    # lowercase all keyword strings
    keywords['public transport'] = keywords['public transport'].str.lower()
    keywords['motorized'] = keywords['motorized'].str.lower()
    keywords['non-motorized'] = keywords['non-motorized'].str.lower()

    transit = [word for word in keywords['public transport'] if not pd.isna(word)]
    motorized = [word for word in keywords['motorized'] if not pd.isna(word)]
    nonmotorized = [word for word in keywords['non-motorized'] if not pd.isna(word)]

    pluralKeywords = [p.plural(word) for word in (transit + motorized + nonmotorized)] # not exhaustive, but captures some meaning

    allKeywords = set(transit + motorized + nonmotorized + pluralKeywords)
    return allKeywords

First we use the `inflect` library to pluralize all our keywords. Some words were already in plural form from when I manualled coded them, but I wish I knew about this library earlier.

### Classifier to remove Words With Multiple Meanings <a href name id="2a">

In [13]:
def tweetClassifier1(tweet, keywords):
    '''
    This classifier captures a list of tweets that might be unrelated to transportation. It uses the following metrics:
    
    If tweet contains "line", "run", "running", or "ford" it checks if another transportation keyword is also in the tweet
    
    Parameters:
    tweet - the tweet containing a text body to be parsed
    keywords - all transportation keywords
    '''
    uncertainKeywords = {'line', 'run', 'running', 'ford', 'lincoln'}
    tweetText = returnLoweredTweetText(tweet)
    tweetTextSet = set(tweetText.split())

    if uncertainKeywords.intersection(tweetTextSet):
        remainingKeywords = keywords.difference(uncertainKeywords)
        if remainingKeywords.intersection(tweetTextSet):
            return 1
        else:
            return -1
    
    return 1

In [14]:
# Run tweet classifier 1
transportationKeywords = saveKeywords()
irrelevantTweets = []
relevantTweets = []

for tweet in allTweets: 
    classification = tweetClassifier1(tweet, transportationKeywords)
    if classification == -1:
        irrelevantTweets.append(tweet)
#         printTweetText(tweet)
#         print(80*'-')  
    else:
        relevantTweets.append(tweet)
        
print("Irrelevant:",len(irrelevantTweets), "Relevant:",len(relevantTweets))

Irrelevant: 6441 Relevant: 10348


### Observe some irrelevant tweets to see if we're capturing accurate information <a href name id="2b">

In [15]:
import random

random.seed(5) # I'm doing this so that the sequence of random numbers is replicable

for i in range(10):
    tweet = random.choice(irrelevantTweets)
    printTweetText(tweet)
    print(80*'-')   

*** Tweet ***
VP Leni's focus remains on the Covid-19 crisis. She has not decided on 2022, and is not "preparing" to run for Governor. She remains open to all options, including a candidacy for President, and at the appropriate time, she will personally convey her decision on this matter.
--------------------------------------------------------------------------------
*** Tweet ***
Centre withdraws insurance cover for healthcare workers who succumbed in the line of #Covid duty
● Collapsed healthcare system &amp; now deceit with Safai karamcharis, ward-boys, nurses, ASHA workers, paramedics, technicians, doctors! #ModiResign 

https://t.co/2NoiWqixKi
--------------------------------------------------------------------------------
*** Tweet ***
Cuomo is cutting Medicaid mid-pandemic. He should never get to run for office as a Democrat again. https://t.co/ABc3z7dulR
--------------------------------------------------------------------------------
*** Tweet ***
Medical supplies continue to 

As you can see, many of the tweets marked as `irrelevant` do **not** contain information related to transportation. Our classifier found that about 1/3 of our collected data is irrelevant! We create a function that captures only relevant tweet IDs and store them in a `.txt` file. 

### Create a classifier which removes non-English tweets <a href name id="2c">
    
We will utilize both Twitter's 'lang' field and the library `langid` to determine whether tweets are English or not. We remove non-English tweets from our corpus.
    
Update: we realize that all the tweets contain the `lang` field and thus can be analyzed without using a heavy library. Thus, we stick to our current classifier using just the set of languages that Twitter supports.

In [16]:
# language codes from: https://developer.twitter.com/en/docs/twitter-api/v1/developer-utilities/supported-languages/api-reference/get-help-languages
codes = [
  {
    "code": "fr",
    "status": "production",
    "name": "French"
  },
  {
    "code": "en",
    "status": "production",
    "name": "English"
  },
  {
    "code": "ar",
    "status": "production",
    "name": "Arabic"
  },
  {
    "code": "ja",
    "status": "production",
    "name": "Japanese"
  },
  {
    "code": "es",
    "status": "production",
    "name": "Spanish"
  },
  {
    "code": "de",
    "status": "production",
    "name": "German"
  },
  {
    "code": "it",
    "status": "production",
    "name": "Italian"
  },
  {
    "code": "id",
    "status": "production",
    "name": "Indonesian"
  },
  {
    "code": "pt",
    "status": "production",
    "name": "Portuguese"
  },
  {
    "code": "ko",
    "status": "production",
    "name": "Korean"
  },
  {
    "code": "tr",
    "status": "production",
    "name": "Turkish"
  },
  {
    "code": "ru",
    "status": "production",
    "name": "Russian"
  },
  {
    "code": "nl",
    "status": "production",
    "name": "Dutch"
  },
  {
    "code": "fil",
    "status": "production",
    "name": "Filipino"
  },
  {
    "code": "msa",
    "status": "production",
    "name": "Malay"
  },
  {
    "code": "zh-tw",
    "status": "production",
    "name": "Traditional Chinese"
  },
  {
    "code": "zh-cn",
    "status": "production",
    "name": "Simplified Chinese"
  },
  {
    "code": "hi",
    "status": "production",
    "name": "Hindi"
  },
  {
    "code": "no",
    "status": "production",
    "name": "Norwegian"
  },
  {
    "code": "sv",
    "status": "production",
    "name": "Swedish"
  },
  {
    "code": "fi",
    "status": "production",
    "name": "Finnish"
  },
  {
    "code": "da",
    "status": "production",
    "name": "Danish"
  },
  {
    "code": "pl",
    "status": "production",
    "name": "Polish"
  },
  {
    "code": "hu",
    "status": "production",
    "name": "Hungarian"
  },
  {
    "code": "fa",
    "status": "production",
    "name": "Farsi"
  },
  {
    "code": "he",
    "status": "production",
    "name": "Hebrew"
  },
  {
    "code": "ur",
    "status": "production",
    "name": "Urdu"
  },
  {
    "code": "th",
    "status": "production",
    "name": "Thai"
  },
  {
    "code": "en-gb",
    "status": "production",
    "name": "English UK"
  }
]

languages = set([item['code'] for item in codes])

print(languages)

{'ja', 'fil', 'en-gb', 'de', 'th', 'ru', 'pl', 'es', 'msa', 'he', 'nl', 'hi', 'sv', 'da', 'fa', 'en', 'zh-tw', 'zh-cn', 'fr', 'fi', 'pt', 'ur', 'ar', 'ko', 'no', 'it', 'hu', 'tr', 'id'}


In [17]:
# Tests langid
import langid

s = "zhe shi zhong wen"
s2 = "I can't believe they would do something like that"
s3 = "I’ll never understand how folks can hear ‘Pa is running out of ICU beds in a pandemic’ and two seconds later flip out because they’re closing bars for one — one — night.  At a complete loss of how these folks process the world."
lang, confidence = langid.classify(s3)
print(lang)

en


In [18]:
def tweetClassifier2(tweet):
    '''
    Classifies English tweets as 1 and non-English tweets as -1
    '''

    if 'lang' in tweet:
        if tweet['lang'] not in {'en-gb', 'en'}:
            return -1
        else:
            if tweet['lang'] != "en":
                print("this is in the field: ",tweet['lang'])
            return 1
    else:
        print('no language field')
        tweetText = returnLoweredTweetText(tweet)
        lang, confidence = langid.classify(tweetText)
        if lang.equals('en'):
            return 1
        else:
            return -1
        
def tweetClassifier_withlibrary(tweet):
    '''
    Classifies English tweets as 1 and non-English tweets as -1
    '''
    
    tweetText = returnLoweredTweetText(tweet)
    lang, confidence = langid.classify(tweetText)
    if lang == 'en':
        return 1
    else:
        return -1

## Store relevant tweet IDs in `.txt` file <a href name id="3a">

In [19]:
def filterRelevantTweetIDs(allTweets, keywords, languages):
    '''
    Uses the classifier to remove irrelevant tweets. Stores relevant tweets in a list.
    '''
    relevantTweets = []
    for tweet in allTweets: 
        classification1 = tweetClassifier1(tweet, transportationKeywords)
        classification2 = tweetClassifier2(tweet)
        if classification1 + classification2 == 2:
            relevantTweets.append(tweet['id'])
    
    return relevantTweets

In [20]:
transportationKeywords = saveKeywords()
relevantTweets = filterRelevantTweetIDs(allTweets, transportationKeywords, languages)
print(len(relevantTweets))
with open('/students/jw10/cs315/all-tweets/covid-mobility-tweet-all-relevant.txt', 'w') as f:
    for tweetID in relevantTweets:
        f.write(f'{tweetID} \n')

9125


### Rehydrate our relevant tweets <a href name id="3a">

In [21]:
os.system('twarc hydrate /students/jw10/cs315/all-tweets/covid-mobility-tweet-all-relevant.txt > /students/jw10/cs315/all-tweets/covid-mobility-tweet-all-relevant.jsonl' )

0

### Now we're ready to do some analysis!