# Part 4 - Data Preprocessing

## Overview
This notebook uses the dataset we accumulated from the Data Cleaning stage and conducts preprocessing using techniques discussed in class. I primarily do preprocessing on our raw dataset and create reusable functions for later analysis. You will find that many of the later notebooks include the same functions to preprocess the data (although I did later create a module/class to do this). There are also some tests of topic modeling visualizations that I later choose to discard from my results.


## Table of Contents
1. [Open our relevant tweets from local `.jsonl` file](#1)
2. [Descriptive Statistics](#2)
    1. [By Users](#2a)
    2. [By Mode](#2b)
    3. [Creation of `DataTransformation` Module](#2c)
3. [Preprocessing the tweets](#3)
4. [Tests of Topic Modeling Visualizations](#4)
5. [Conclusions + Post-Data Exploration Thoughts](#5)

## Open our relevant tweets from local `.jsonl` file <a href name id="1">
    
Now that we have a file with our full set of tweets, we can access them directly.

In [6]:
import sys

sys.path.append(".")

import helper.DataRetrieval as dr

In [7]:
import json

relevantTweets = []
with open('/students/jw10/cs315/all-tweets/covid-mobility-tweet-all-relevant.jsonl') as f:
    for line in f:
        relevantTweets.append(json.loads(line))

print("We have collected:", len(relevantTweets), "relevant tweets and it is stored in", type(relevantTweets), "format.")

We have collected: 9123 relevant tweets and it is stored in <class 'list'> format.


## Descriptive Statistics <a href name id="2">

To obtain descriptive statistics, we aggregate our tweets based on (1) users and (2) transportation mode.

### By Users <a href name id="2a">

In [8]:
# relevantTweets[0]

In [9]:
def getUniqueUsers(tweets):
    '''
    Returns a dictionary with users/tweets as keys/value pairs
    '''
    userDict = {}

    for tweet in relevantTweets:
        tweetID = tweet['id']
        user = tweet['user']['id_str'] # user ID
        if user not in userDict:
            userDict[user] = [tweetID]
        else:
            userDict[user].append(tweetID)

    return userDict

In [10]:
uniqueUsers = getUniqueUsers(relevantTweets)

In [11]:
print(len(uniqueUsers.keys()))

8931


These results tell us that most Transportation+COVID-19 tweets captured were tweeted by unique users. This is not particularly helpful in terms of gathering more insights into the top user engagement, so we choose to focus on the modes of transportation.

In [12]:
maxTweetCount, maxTweeter = 0, 'none'
for user in uniqueUsers:
    tweetCount = len(uniqueUsers[user])
    if tweetCount > maxTweetCount:
        maxTweetCount, maxTweeter = tweetCount, user
        
print("Highest tweet count for single user:", maxTweetCount, "\nID of user with highest tweet count:", maxTweeter)

Highest tweet count for single user: 22 
ID of user with highest tweet count: 252563614


### By transportation mode <a href name id="2b">


In [17]:
import pandas as pd
import inflect

def getKeywordsByCategory():
    '''
    Gets keywords from the .csv file and adds the pluralized keywords. Returns a set of all keywords by category
    '''
    p = inflect.engine()
    keywords = pd.read_csv('./cs315_keywords.csv')
    # lowercase all keyword strings
    keywords['public transport'] = keywords['public transport'].str.lower()
    keywords['motorized'] = keywords['motorized'].str.lower()
    keywords['non-motorized'] = keywords['non-motorized'].str.lower()

    transit = [word for word in keywords['public transport'] if not pd.isna(word)]
    motorized = [word for word in keywords['motorized'] if not pd.isna(word)]
    nonmotorized = [word for word in keywords['non-motorized'] if not pd.isna(word)]
    
    transitPlural = [p.plural(word) for word in transit] # not exhaustive, but captures some meaning
    motorizedPlural = [p.plural(word) for word in motorized] # not exhaustive, but captures some meaning
    nonmotorizedPlural = [p.plural(word) for word in nonmotorized] # not exhaustive, but captures some meaning
    
    return (set(transit+transitPlural), set(motorized+motorizedPlural), set(nonmotorized+nonmotorizedPlural))


In [34]:
transit, motorized, nonmotorized = getKeywordsByCategory()

print(transit)
print(motorized)
print(nonmotorized)
# get the length of the keywords + plurals set
print(len(transit.union(motorized).union(nonmotorized)))

{'mass transits', 'railroad', 'shuttles', 'lines', 'amtrak', 'transportations', 'green line', 'heavy rail', 'caltrains', 'line', 'mtas', 'metro', 'caltrain', 'silver line', 'metrorail', 'bus', 'monorail', 'streetcar', 'bart', 'cable car', 'transports', 'blue lines', 'rapid transits', 'light rail', 'monorails', 'mta', 'metropolitans', 'railways', 'muni', 'blue line', 'metropolitan', 'red lines', 'rail', 'mbta', 'mbtas', 'barts', 'light rails', 'railway', 'heavy rails', 'railroads', 'streetcars', 'buse', 'commuter rails', 'subways', 'metrorails', 'rapid transit', 'silver lines', 'cable cars', 'rails', 'amtraks', 'mass transportations', 'metros', 'transit', 'munis', 'subway', 'green lines', 'buses', 'station', 'mass transportation', 'shuttle', 'stations', 'trolley', 'transport', 'mass transit', 'transportation', 'trolleys', 'commuter rail', 'transits', 'red line'}
{'honda', 'chevrolets', 'scooter', 'jeep', 'highways', 'toyotas', 'uber', 'lincolns', 'infiniti', 'routes', 'subaru', 'engine'

In [35]:
# copied from Data Cleaning notebook
def returnLoweredTweetText(tweet):
    '''
    Returns the non-truncated tweet text in lower case
    '''
    if "retweeted_status" not in tweet: 
        if 'full_text' not in tweet:
            return tweet['text'].lower()
        else:
            return tweet['full_text'].lower()
    else:
        if 'full_text' not in tweet['retweeted_status']:
            return tweet['retweeted_status']['text'].lower()
        else:
            return tweet['retweeted_status']['full_text'].lower()
        
def printTweetText(tweet):
    '''
    Prints out tweet text
    '''
    print('*** Tweet ***')
    if "retweeted_status" not in tweet: 
        if 'full_text' not in tweet:
            print(tweet['text'])
        else:
            print(tweet['full_text'])
    else:
        if 'full_text' not in tweet['retweeted_status']:
            print(tweet['retweeted_status']['text'])
        else:
            print(tweet['retweeted_status']['full_text'])



In [36]:
def extractMode(tweet, transit, motorized, nonmotorized):
    '''
    Returns the category of transit given the keywords that it is in. Tweets containing multiple categories are 
    added to each.
    '''
    modes = set()
    tweetText = returnLoweredTweetText(tweet)
    tweetTextSet = set(tweetText.split())
    if transit.intersection(tweetTextSet):
        modes.add('transit')
    if motorized.intersection(tweetTextSet):
        modes.add('motorized')
    if nonmotorized.intersection(tweetTextSet):
        modes.add('nonmotorized')

    return modes

def getTweetsByTransportationMode(tweets, categories):
    '''
    Returns a dictionary with transport-mode/tweets as keys/value pairs
    '''
    modeDict = {}
    transit, motorized, nonmotorized = categories

    for tweet in tweets:
#         tweetID = tweet['id']
        modes = extractMode(tweet, transit, motorized, nonmotorized)
        for mode in modes:
            if mode not in modeDict:
                modeDict[mode] = [tweet]
            else:
                modeDict[mode].append(tweet)

    return modeDict

In [37]:
tweetsByMode = getTweetsByTransportationMode(relevantTweets, getKeywordsByCategory())

print("Transit", len(tweetsByMode['transit']))
print("Motorized", len(tweetsByMode['motorized']))
print("Nonomotorized", len(tweetsByMode['nonmotorized']))

Transit 3000
Motorized 3687
Nonomotorized 2507


In [52]:
# we do this in the Text Analysis notebook
# import matplotlib.pyplot as plt

# fig = plt.figure()
# ax = fig.add_axes([0,0,1,1])
# modes = ['Transit', 'Motorized', 'Non-Motorized']
# counts = [len(tweetsByMode['transit']),len(tweetsByMode['motorized']), len(tweetsByMode['nonmotorized'])]
# ax.bar(modes,counts)

# # customizations

# plt.show()

# plt.savefig('tweetsPerCategory.png')

### Creation of `DataTransformation` Module <a href name id="2c">

I have taken the above functions and created a module called `DataTransformation`. The class `TransportationMode` splits our tweets into their respective categories. We will use this to prepare our files for LIWC analysis.

In [39]:
transitDF = pd.DataFrame.from_dict(tweetsByMode['transit'])

print(transitDF.keys())

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'source', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo',
       'coordinates', 'place', 'contributors', 'retweeted_status',
       'is_quote_status', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status_permalink', 'retweet_count', 'favorite_count',
       'favorited', 'retweeted', 'lang', 'possibly_sensitive', 'quoted_status',
       'extended_entities', 'withheld_in_countries'],
      dtype='object')


In [40]:
import random

random.seed(5) # I'm doing this so that the sequence of random numbers is replicable

for i in range(10):
    tweet = random.choice(tweetsByMode['motorized'])
    printTweetText(tweet)
    print(80*'-')   

*** Tweet ***
rt @wef @mikequindazzi @antgrasso @fisher85m
Remote working and online shopping could drive 14 million cars off US roads – permanently https://t.co/flGl4HiptR #covid19 #sdi20 https://t.co/u7OM3XRPZa
--------------------------------------------------------------------------------
*** Tweet ***
Info for Pak Exporters/importers:

All ports are operational in Italy.
No route for merchandise/trade cargo is blocked.
No Pakistani export/import consignment is halted in Italy due to Corona Virus issue.
Plz let us know if any of your consignment gets stuck anywhere in Italy. https://t.co/MXzO7IU3Yb
--------------------------------------------------------------------------------
*** Tweet ***
@sergiart5_ @fjruiddiid @XxRaptor_xX @qLxke_ The chances of serious illness are pretty much negligible. And when it comes to death, pretty much nonexistent. You’re more likely to die in a car crash then die from COVID when vaccinated, but people still drive cars
--------------------------------

## Preprocessing the tweets <a href name id='3'>

Tweets are preprocessed in a similar fashion to that described in [this paper](https://www.cs.toronto.edu/~jstolee/projects/topic.pdf ), which also follows similar conventions to citations [3],[4] in that paper.


In [41]:
import urllib

def getCovidKeywordsList():
    url ='https://raw.githubusercontent.com/echen102/COVID-19-TweetIDs/master/keywords.txt'
    file = urllib.request.urlopen(url)

    covidKeywords = []
    for line in file:
        decoded_line = line.decode("utf-8").split()
        covidKeywords.append(decoded_line[0].lower())

    return covidKeywords

print(getCovidKeywordsList())

['coronavirus', 'koronavirus', 'corona', 'cdc', 'wuhancoronavirus', 'wuhanlockdown', 'ncov', 'wuhan', 'n95', 'kungflu', 'epidemic', 'outbreak', 'sinophobia', 'china', 'covid-19', 'corona', 'covid', 'covid19', 'sars-cov-2', 'covidー19', 'covd', 'pandemic', 'coronapocalypse', 'canceleverything', 'coronials', 'socialdistancingnow', 'social', 'socialdistancing', 'panicbuy', 'panic', 'panicbuying', 'panic', '14dayquarantine', 'duringmy14dayquarantine', 'panic', 'panic', 'panicshop', 'inmyquarantinesurvivalkit', 'panic-buy', 'panic-shop', 'coronakindness', 'quarantinelife', 'chinese', 'chinesevirus', 'stayhomechallenge', 'stay', 'sflockdown', 'dontbeaspreader', 'lockdown', 'lock', 'shelteringinplace', 'sheltering', 'staysafestayhome', 'stay', 'trumppandemic', 'trump', 'flattenthecurve', 'flatten', 'china', 'chinavirus', 'quarentinelife', 'ppeshortage', 'saferathome', 'stayathome', 'stay', 'stay', 'stayhome', 'getmeppe', 'covidiot', 'epitwitter', 'pandemie', 'wear', 'wearamask', 'kung', 'covid

In [48]:
import string
import nltk
from nltk.tokenize import TweetTokenizer
import re

stopwordsList = nltk.corpus.stopwords.words('english')
punctuation = string.punctuation
covidKeywordsList = getCovidKeywordsList()

def removeLinks(tweetString):
    '''Takes a string and removes web links from it'''
    tweetString = re.sub(r'http\S+', '', tweetString) # remove http links
    tweetString = re.sub(r'bit.ly/\S+', '', tweetString) # rempve bitly links
    tweetString = tweetString.strip('[link]') # remove [links]
    return tweetString

def cleanTweets(someTweets):
    """Given a string that it's a tweet or many tweets joined together,
    clean it up to use for further analysis.
    """
    # Your code here
    # 1) lowercase tweet words
    loweredTweets = someTweets.lower()
    
    # 2) tokenize tweets
    tweet_tokenizer = TweetTokenizer()
    tokens = tweet_tokenizer.tokenize(loweredTweets)
    
    # 3) Remove stopwords
    cleanTweets = [w for w in tokens if w not in stopwordsList]

    # 4) Remove punctuation
    cleanTweets = [w for w in cleanTweets if w not in punctuation]

    # Remove @
    cleanTweets = [w for w in cleanTweets if w[0] != '@']

    # Remove numbers
    cleanTweets = [w for w in cleanTweets if not w.isnumeric()]

    # 5) Remove COVID keywords
    cleanTweets = [w for w in cleanTweets if not (w in covidKeywordsList or w[1:] in covidKeywordsList)]
    
    # random symbols
    cleanTweets = [w for w in cleanTweets if '’' not in w]
    
    return cleanTweets

# def oldCleanTweets(someTweets):
#     """Given a string that it's a tweet or many tweets joined together,
#     clean it up to use for further analysis.
#     """
#     # Your code here
#     # 1) lowercase tweet words
#     loweredTweets = someTweets.lower()
    
#     # 2) tokenize tweets
#     tweet_tokenizer = TweetTokenizer()
#     tokens = tweet_tokenizer.tokenize(loweredTweets)
    
#     # 3) Remove stopwords
#     cleanTweets = [w for w in tokens if w not in stopwordsList]
    
#     # 4) Remove punctuation
#     cleanTweets = [w for w in cleanTweets if w not in punctuation]
    
#     # 5) Remove COVID keywords
#     cleanTweets = [w for w in cleanTweets if not (w in covidKeywordsList or w[1:] in covidKeywordsList)]
    
#     cleanTweets = [w for w in cleanTweets if '’' not in w]
    
#     return cleanTweets

We aggregate data into pseudodocuments based on the mode of transport used.

In [49]:
transitWords = ''
motorizedWords = ''
nonmotorizedWords = ''

words = {}
for mode in tweetsByMode:
    for tweet in tweetsByMode[mode]:
        text = returnLoweredTweetText(tweet)
        if mode not in words:
            words[mode] = text
        else:
            words[mode] += ' ' + text
    
    
words.keys()

dict_keys(['nonmotorized', 'motorized', 'transit'])

In [50]:
transitDoc = removeLinks(words['transit'])
nonmotorizedDoc = removeLinks(words['nonmotorized'])
motorizedDoc = removeLinks(words['motorized'])

cleanTransit = cleanTweets(transitDoc)
cleanNonmotorized = cleanTweets(nonmotorizedDoc)
cleanMotorized = cleanTweets(motorizedDoc)

type(cleanTransit)

list

In [51]:
from collections import Counter

print(Counter(cleanTransit).most_common(10))
print(Counter(cleanNonmotorized).most_common(10))
print(Counter(cleanMotorized).most_common(10))

[('transport', 663), ('station', 531), ('public', 509), ('bus', 495), ('people', 436), ('railway', 395), ('transportation', 360), ('workers', 320), ('city', 274), ('metro', 254)]
[('walk', 893), ('walking', 878), ('one', 608), ('around', 444), ('night', 438), ('people', 428), ('go', 311), ('think', 310), ('would', 301), ('right', 280)]
[('car', 943), ('road', 697), ('people', 557), ('traffic', 419), ('cars', 341), ('get', 323), ('roads', 296), ('city', 245), ('...', 230), ('vehicles', 228)]


### Tests of Topic Modeling Visualizations <a href name id='4'>

After trying out various libraries/methods for topic visualization, I found that topics are very difficult to decipher. Thus, we go reiterate the data preparation <--> modeling step, this time focusing on sentiment analysis.

[Can ignore below]

In [55]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# Create the TF vector representation, this only counts the terms in each document

tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)

dtm_tf = tf_vectorizer.fit_transform(cleanTransit)

tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(cleanTransit)
print(dtm_tfidf.shape)

# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tfidf.fit(dtm_tfidf)
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

(43146, 734)


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


In [56]:
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer, mds='mmds')

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


In [58]:
dtm_tf = tf_vectorizer.fit_transform(cleanNonmotorized)

tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(cleanNonmotorized)
print(dtm_tfidf.shape)

# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tfidf.fit(dtm_tfidf)
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer, mds='mmds')

(33024, 460)


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


In [59]:
dtm_tf = tf_vectorizer.fit_transform(cleanMotorized)

tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(cleanMotorized)
print(dtm_tfidf.shape)

# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tfidf.fit(dtm_tfidf)
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer, mds='mmds')


(51572, 863)


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


In [67]:
corpus = list(cleanTransit) + list(cleanNonmotorized) + list(cleanMotorized)

dtm_tf = tf_vectorizer.fit_transform(corpus)

tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(corpus)
print(dtm_tfidf.shape)

# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tfidf.fit(dtm_tfidf)
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer, mds='mmds')


(127742, 1994)


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


### Conclusions + Post-Data Exploration Thoughts <a href name id='5'>
I tried out a `biterm topic modeling` library but obtained quite strange results. Based on the pyLDAvis visualization, it did not seem likely that my corpus would obtain enough meaningful information to form coherent topics. This exploration led me to turn my focus toward sentiment analysis rather than topic modeling.

### [INCOMPLETE/INCOHERENT] LDA and Word Embeddings Analysis

In [88]:
import bitermplus as btm
import numpy as np
import pandas as pd

# IMPORTING DATA
texts = cleanTransit

# PREPROCESSING
# Obtaining terms frequency in a sparse matrix and corpus vocabulary
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
tf = np.array(X.sum(axis=0)).ravel()
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
docs_lens = list(map(len, docs_vec))
# Generating biterms
biterms = btm.get_biterms(docs_vec)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# METRICS
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
# or
perplexity = model.perplexity_
coherence = model.coherence_

# LABELS
model.labels_
# or
btm.get_docs_top_topic(texts, model.matrix_docs_topics_)

100%|███████████████████████████████████████████████| 20/20 [00:00<00:00, 14344.40it/s]
100%|████████████████████████████████████████| 44958/44958 [00:00<00:00, 134097.89it/s]


Unnamed: 0,documents,label
0,’,0
1,understand,2
2,kept,2
3,saying,2
4,public,2
...,...,...
44953,must,2
44954,demand,2
44955,johnson,2
44956,’,0
