# Text Preprocessing & Sentiment Analysis

In this notebook, we read in the datasets collected in notebook 01 and format our text data (i.e. tweets). This includes removing stopwords, lemmatization, removing emails/urls, and making all text lowercase (this was all done with Spacy). Next, we used Vader Sentiment to analyze our data and produce a polarity rating. We chose Vader Sentiment because draft versions of our code using Text Blob seemed to perform more poorly and because Vader features a compound score, which is a 'normalized, weighted composite score' (quoted from the Vader documentation linked in the readme). Polarity is assigned to both raw and processed text for potential comparison.

Additionally, columns are further formatted/organized here, and each dataset is exported for data visualization.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
from datetime import date, timedelta
import os
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from collections import Counter


Bad key "text.kerning_factor" on line 4 in
/Users/jonathanbenton/opt/anaconda3/envs/dsi/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


In [8]:
# Reads in dataframes from notebook 01 and concatenates subsets into master datasets
tweets_2020 = pd.concat([pd.read_csv('../data/tweets/raw_top10_cities_2020.csv'), pd.read_csv('../data/tweets/raw_bottom10_cities_2020.csv')],
                        ignore_index=True)
tweets_2019 = pd.concat([pd.read_csv('../data/tweets/raw_top10_cities_2019.csv'), pd.read_csv('../data/tweets/raw_bottom10_cities_2019.csv')],
                        ignore_index=True)
keywords_tweets = pd.concat([pd.read_csv('../data/tweets/raw_top5_cities_keywords.csv'), pd.read_csv('../data/tweets/raw_bottom5_cities_keywords.csv')],
                        ignore_index=True)

# Checks dtypes and nulls
print(tweets_2020.info())
print(tweets_2019.info())
print(keywords_tweets.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397195 entries, 0 to 397194
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    397195 non-null  object
 1   date    397195 non-null  object
 2   city    397195 non-null  object
dtypes: object(3)
memory usage: 9.1+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396963 entries, 0 to 396962
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    396963 non-null  object
 1   date    396963 non-null  object
 2   city    396963 non-null  object
dtypes: object(3)
memory usage: 9.1+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37658 entries, 0 to 37657
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    37658 non-null  object
 1   date    37658 non-null  object
 2   city    37658 non-null  object
dtypes: object(3)
memory usage: 

### COVID-19 Word Frequency
Here we searched through our 2020 tweets to see which words in a large list of COVID-19-related words appear most in our tweets. From that list, we picked seven that we felt were most relevant to our project and repulled tweets from our chosen cities that contained any of those words (this is seen in notebook 01 under the "Tweets By Keywords 2020" header). 

In [12]:
covid_keywords = ['coronavirus', 'koronavirus', 'corona', 'virus', 'covid', 'covid-19', 'Covid', 'Covid-19', 
                  'COVID', 'COVID-19', 'covid19', 'COVID19', 'Pandemic', 'pandemic', 'Epidemic', 'epidemic', 
                  'outbreak', 'Outbreak', 'China', 'china', 'covd', 'COVD', 'coronapocalypse', 'coronials', 
                  'Coronials', 'social', 'distancing', 'Social', 'Distancing', 'socialdistancing', 'panicbuy', 
                  'panic buy', 'panic buying' 'panicbuying', 'quarantine', 'quarantining', 'lock down', 
                  'lockdown', 'chinese', 'virus', 'chinesevirus', 'trumppandemic', 'trumpandemic', 
                  'trump pandemic', 'Trump pandemic', 'flattenthecurve', 'flatten the curve', 'quarantinelife', 
                  'quarentine', 'stayathome', 'stay at home', 'stayhome', 'stay home','StayHome', 'withme', 
                  'quarantineandchill', 'mask']

 # tokenize
biglist = [i.split(" ") for i in tweets_2020.text]
# recursive unpacking of nested lists
biglist = [item for items in biglist for item in items]
# filtering for covid kewords
counts = [word for word in biglist if word in covid_keywords]
# making a counter object from filtered list
c = Counter(counts)
# inspecting top 20
c.most_common(20)

[('quarantine', 1791),
 ('social', 1652),
 ('virus', 1123),
 ('pandemic', 954),
 ('mask', 886),
 ('COVID-19', 826),
 ('distancing', 693),
 ('lockdown', 690),
 ('coronavirus', 635),
 ('China', 575),
 ('COVID', 327),
 ('Social', 313),
 ('Covid', 286),
 ('corona', 271),
 ('covid', 248),
 ('Covid-19', 233),
 ('Pandemic', 130),
 ('COVID19', 119),
 ('covid-19', 107),
 ('outbreak', 92)]

### Text Preprocessing

In [24]:
# Creates a for loop that processes the text data 
df_list = [tweets_2020, tweets_2019, keywords_tweets] 
for i, df in enumerate(df_list):
    print(i) # enumerate/printing here allowed us to track progress, as this code ran for many hours
    print()
    df['time'] = pd.to_datetime(df['date'], utc=True).dt.time
    df['date'] = pd.to_datetime(df['date'], utc=True).dt.date
    df = df[['text', 'date', 'time', 'city']]
    nlp = spacy.load('en_core_web_sm')
    df['lemmata'] = [[token.lemma_ for token in doc if token.lemma_ != '-PRON-' \
                           if token.is_alpha if not token.is_punct if not token.is_stop \
                           if not token.like_url if not token.like_email] \
                           for doc in [nlp(i) for i in df.text]]
    # lower the lemmata
    df['lemmata'] = [[token.lower() for token in doc] for doc in df.lemmata]

0



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


1

2



### Sentiment Analysis

In [None]:
# Uses Vader to analyze both raw tweets and cleaned tweets
for df in df_list:
    df['text_polarity'] = df['text'].apply(lambda text: pd.Series(SentimentIntensityAnalyzer().polarity_scores(text)['compound']))
    df['lemmata_polarity'] = df['lemmata'].apply(lambda text: pd.Series(SentimentIntensityAnalyzer().polarity_scores(text)['compound']))

In [None]:
# Creates top_10 flag for visualization notebook
top_10 = ['New York City', 'Washington, D.C.', 'Anchorage',
       'Honolulu', 'Newark', 'Providence', 'Seattle', 'Boston',
       'Manchester', 'Charleston']
for df in df_list:
    df['top_10'] = np.where(df['city'].isin(top_10), 1, 0)
    df = df[['city', 'top_10', 'date', 'time', 'text', 'text_polarity', 'lemmata', 'lemmata_polarity']]

### Remove Duplicates

In [4]:
# Checks the difference in data after duplicates are removed
print(tweets_2019.shape)
print(tweets_2020.shape)
print(tweets_keywords.shape)
print()
print(tweets_2019.drop_duplicates(subset='text').shape)
print(tweets_2020.drop_duplicates(subset='text').shape)
print(tweets_keywords.drop_duplicates(subset='text').shape)

(396963, 8)
(397195, 8)
(37658, 8)

(368761, 8)
(367369, 8)
(23803, 8)


In [5]:
# Drops duplicate tweets and saves cleaned tweets w/sentiment analysis as csv for data viz
tweets_2019.drop_duplicates(subset='text').to_csv('../data/tweets/tweets_2019_polarity.csv', index=False)
tweets_2020.drop_duplicates(subset='text').to_csv('../data/tweets/tweets_2020_polarity.csv', index=False)
tweets_keywords.drop_duplicates(subset='text').to_csv('../data/tweets/tweets_keywords_polarity.csv', index=False)