 <h1> Big-Data Analytics </h1>
 <h4> Final Project: Group 25 </h4>
 

The goal of the project is to process a raw dataset from beginning to end, going through each of the steps in the data science pipeline. The ongoing coronavirus crisis has upended not only our semester, but our lives, so our group could not think of a more fitting topic to examine with this project. There are now over 18,000 published papers about the new coronavirus in the database of the USA's National Library of Medicine (https://pubmed.ncbi.nlm.nih.gov/), reflecting 6 months of experience with the coronavirus. However, amidst the rapidly evolving threat, the most up-to-date information was often shared via social media platforms both for front-line ER doctors sharing clinical procedures (https://www.nytimes.com/2020/03/18/well/live/coronavirus-doctors-facebook-twitter-social-media-covid.html) and for the broader scientific community and reporters, sharing insights and ideas to combat the crisis, and policy proposals to limit the economic effects of the crisis.

Due to its timeliness and reputation, as well as the company's relative openness to sharing data with developers via their API, made Twitter a logical source of data for this project.

Early in the crisis, many people the world over (though most prominently, populists with their "anti-expert" bias) claimed the media was overreacting to the crisis, and sensationalizing the threat. While this interpretation has been clearly born out as false, our group is interested in how closely the more organic, bottom-up "buzz" around the virus correlates with the actual rates of infection. Much has been made about the virus's exponential spread: do posts about the virus in affected regions follow the same pattern? How can we understand not only the patterns within countries but the differences between countries?

The following notebook includes the code used for exploring, processing, analyzing, and visualizing the data set. The data sets themselves can be downloaded via the GitHub link for the project. While the coronavirus case counts were provided in an already aggregated form and required very limited preprocessing, due to the massive amount of unaggregated Twitter data, the study was limited in time from February 12th through April 10th. 

Early in the semester we chose to look at Italy for a case study, as indications weren't clear how the pandemic would play out. As the semester progressed, and the pandemic continued its spread, we realized that a comparison of two countries would yield a more fruitful comparison. With a higher number cases and deaths than any country in the world, the USA provides a solid point of comparison with Italy. Due to the large number of cases (and tweets) in the United States, New York state was chosen as our case study. Throughout the pandemic, New York has had the highest number of cases and deaths of any US state, and New York City has seen the worst of it (see Figure 2). 

<h1> Data collection/acquisition </h1>

The following section is the code used to query the Twitter API and to obtain the data used in the project. Unlike the rest of our notebook, this code is not meant to be run, important aspects (including the API keys) are deliberately left out to streamline the code. The following code is meant to show the key parts of the approach taken, additional code can be provided on request.

As can be seen in the code, the hashtag strings used to query Twitter were #COVID19, #COVID-19, #covid_19, #covid, #corona, and #coronavirus  (note: Twitter hashtags are not case-sensitive). A search through Twitter to find the main hashtags used when people discussed the pandemic yielded the selection of these tags. If any of these was mentioned in a tweet, that tweet becomes part of our dataset. As the twitter API provides a great deal of metadata, we limited the query to data points of key interest.

 ```python
import requests
import base64
import json
import time

base = 'https://api.twitter.com/1.1/tweets/search/'
#Define your keys from the developer portal
client_key = ''
client_secret = ''
#Reformat the keys and encode them
key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')

# Transform from bytes to bytes that can be printed
b64_encoded_key = base64.b64encode(key_secret)
#Transform from bytes back into Unicode
b64_encoded_key = b64_encoded_key.decode('ascii')

# Do first request manually and paste next_page here
next_page = ''


def auth():
    auth_headers = {
    'Authorization': 'Basic {}'.format(b64_encoded_key)
    }
    auth_data = {
        'grant_type': 'client_credentials'
    }
    auth_resp = requests.post('https://api.twitter.com/oauth2/token', headers=auth_headers, data=auth_data)
    token = auth_resp.json()['access_token']

    session = requests.Session()
    session.headers = {'Authorization': 'Bearer ' + token}
    return session


def get30daysNY(session):
    global next_page

    res = session.post('https://api.twitter.com/1.1/tweets/search/30day/dev.json', json={
        "query" : "place:NY (#COVID19 OR #COVID-19 OR #covid_19 OR #covid OR #corona OR #coronavirus)",
        "fromDate":"202003120000", 
        "toDate":"202004110000",
        "next": next_page
    })

    json_res = res.json()
    try:
        next_page = json_res['next']

    except:
        next_page = None

    print('next_page ', next_page)

    return json_res


def getArchiveNY(session):
    global next_page
    
    res = session.post('https://api.twitter.com/1.1/tweets/search/fullarchive/dev.json', json={
        "query" : "place:NY (#COVID19 OR #COVID-19 OR #covid_19 OR #covid OR #corona OR #coronavirus)",
        "fromDate":"202002120000", 
        "toDate":"202003120000",
        "next": next_page
    })

    json_res = res.json()
    try:
        next_page = json_res['next']

    except:
        next_page = None

    print('next_page ', next_page)

    return json_res


if __name__ == "__main__":
    session = auth()


    i = 2

    while True:
        # avoid rate limits
        time.sleep(5)
        # change the method to the one you want
        res = getArchiveNY(session)
        # res = get30daysNY(session)

        with open('ny-' + str(i) + '.json', 'w', encoding='utf-8') as f:
            json.dump(res, f, ensure_ascii=False, indent=2)

        i += 1

        if next_page == None:
            break
```

<h1> Twitter Data preprocessing/cleaning </h1>




In [1]:
import json
import os
import pandas as pd
from datetime import datetime

def get_all_filesname(files, state):
    if state == "italy":
        files = os.listdir('Data/twitter-data/IT') 
    elif state == "ny":
        files = os.listdir('Data/twitter-data/NY')
    return files
    
def date_collector(data_set, d):
    for i in data_set['results']:
        date = date_transformer(i['created_at'][4:10])
        if not date in list(d.keys()):
            d[date] = 1
        else:
            d[date] +=1

    return d

def files_walker(files, d, state):
    if state == "ny":
        for i in range(len(files)):
            with open('Data/twitter-data/NY/'+files[i], 'r', encoding='utf8', errors='ignore') as f:
                region = json.load(f)
                date_collector(region, d)
    elif state =="italy":
        for i in range(len(files)):
            with open('Data/twitter-data/IT/'+files[i], 'r', encoding='utf8', errors='ignore') as f:
                region = json.load(f)
                date_collector(region, d)
    return d

def date_transformer(string_data):
    date = string_data + ' 2020'
    return datetime.strptime(date,"%b %d %Y")


<h3> Italy </h3>

In [2]:
# dictionary for tweets per day
italia = {}

# array with name of all files in a folder
italia_json_files = []

italia_json_files = get_all_filesname(italia_json_files, 'italy')
italia = files_walker(italia_json_files, italia, "italy")
df_italy = pd.DataFrame.from_dict(italia, orient= 'index', columns=['tweets'])
df_italy = df_italy.sort_index()
df_italy[:5]

Unnamed: 0,tweets
2020-02-12,55
2020-02-13,40
2020-02-14,33
2020-02-15,48
2020-02-16,31


<h3> New York </h3>

In [3]:
# dictionary for tweets per day
ny = {}

# array with name of all files in a folder
ny_json_files = []

ny_json_files = get_all_filesname(ny_json_files, 'ny')
ny = files_walker(ny_json_files, ny,'ny')
df_ny = pd.DataFrame.from_dict(ny, orient= 'index', columns=['tweets'])
df_ny = df_ny.sort_index()
df_ny[:5]

Unnamed: 0,tweets
2020-02-12,16
2020-02-13,25
2020-02-14,21
2020-02-15,21
2020-02-16,20


# Exploratory Data Analysis

The next section shows the steps performed as part of our exploratory data analysis in order to grasp general patterns in both the Twitter data and case data. Figure 1 shows that the frequency of covid-related tweets in both New York and Italy seem to be moving largely in tandem, with the exception of Italy's first peak on February 24th, which corresponds to the fact that (known) cases in Italy began appearing 1-2 weeks before the coronavirus hit the United States.

In [13]:
import plotly.express as px

df_italy['region'] = 'Italy'
df_italy['date'] = df_italy.index
df_ny['region'] = 'New York'
df_ny['date'] = df_ny.index
df=pd.concat([df_ny, df_italy], ignore_index=True)

fig = px.line(df, x='date', y='tweets',color='region', title='Figure 1: Covid19 related tweets per day')
fig.update_layout(plot_bgcolor='white', yaxis={'gridcolor' : 'gray'})
fig.show()

In [5]:
#COVID-19 CASE DATA IMPORT
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

#USA data as saved from https://github.com/nytimes/covid-19-data 
df_usa = pd.read_csv('Data/covid-data/us-counties.csv', sep=',', index_col='date')
df_nys = df_usa[df_usa['state'].str.contains("New York")]

#italy data as saved from https://github.com/pcm-dpc/COVID-19/tree/master/dati-regioni 
df_ita = pd.read_csv('Data/covid-data/dpc-covid19-ita-regioni.csv', sep=',', index_col='data')

display(df_nys.groupby('county')[['cases','deaths']].sum().head())


Unnamed: 0_level_0,cases,deaths
county,Unnamed: 1_level_1,Unnamed: 2_level_1
Albany,8895,228
Allegany,406,7
Broome,1958,112
Cattaraugus,341,0
Cayuga,362,12


In [14]:
#case data for ny (state)
df_nys = df_nys.loc[:'2020-04-10']
df_nys['date']=df_nys.index

#case data for nyc
df_nyc = df_usa[df_usa['county'].str.contains("New York City")]
df_nyc = df_nyc.loc[:'2020-04-10']

@interact
def show_plot(statistic=['cases','deaths']):
    fig = px.line(df_nys, x='date', y=statistic,color='county', title='Figure 2: Daily Covid19 cases in New York')
    fig.update_layout(plot_bgcolor='white', yaxis={'gridcolor' : 'gray'})
    fig.show()

interactive(children=(Dropdown(description='statistic', options=('cases', 'deaths'), value='cases'), Output())…

In [17]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
 
def calc_log(data):
    return data.apply(lambda x: np.log(x) if x>0 else 0)

def daily(data):
    return data-data.shift(1)

def transform(data, mode):
    if mode=='log':
        res = calc_log(data) 
    elif mode=='daily':
        res = daily(data) 
    else:
        res= data
    return res

choice =  ['all']+list(df_nys['county'].unique())
@interact
def show_plot(region=choice, numbers=['absolut','log','daily']):
    if region=='all':
        df = df_nys.groupby(df_nys.index)[['cases','deaths']].sum()
    else:
        df=df_nys[df_nys['county']==region]
    # Create figure with secondary y-axis
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    # Add traces
    fig.add_trace(
        go.Scatter(x=df.index, y=transform(df['cases'],numbers), name="cases"),
        secondary_y=False,
    )
    fig.add_trace(
        go.Scatter(x=df.index, y=transform(df['deaths'],numbers), name="deaths"),
        secondary_y=True,
    )

    # Add figure title
    fig.update_layout(
        title_text='Figure 3: Cases and Deaths by Region',
        plot_bgcolor='white', 
        yaxis={'gridcolor' : 'gray'}
    )

    # Set x-axis title
    fig.update_xaxes(title_text="date")

    # Set y-axes titles
    fig.update_yaxes(title_text="cases", secondary_y=False)
    fig.update_yaxes(title_text="deaths", secondary_y=True)

    fig.show()


interactive(children=(Dropdown(description='region', options=('all', 'New York City', 'Westchester', 'Nassau',…

In [21]:
NY=df_nys.groupby(df_nys.index)[['cases','deaths']].sum()
NY = daily(NY)
NY.index = pd.to_datetime(NY.index)
df_compare = pd.concat([df_ny['tweets'],NY],axis=1,join='inner')

@interact
def show_plot(compare=['cases','deaths']):
    df = df_compare
    # Create figure with secondary y-axis
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    # Add traces
    fig.add_trace(
        go.Scatter(x=df.index, y=df['tweets'], name="tweets"),
        secondary_y=False,
    )
    fig.add_trace(
        go.Scatter(x=df.index, y=df[compare], name=compare),
        secondary_y=True,
    )

    # Add figure title
    fig.update_layout(
        title_text='Figure 4: NY Comparison of tweets and ' + compare,
        plot_bgcolor='white', 
        yaxis={'gridcolor' : 'gray'}
    )

    # Set x-axis title
    fig.update_xaxes(title_text="date")

    # Set y-axes titles
    fig.update_yaxes(title_text="tweets about Covid19", secondary_y=False)
    fig.update_yaxes(title_text='daily '+ compare, secondary_y=True)

    fig.show()


interactive(children=(Dropdown(description='compare', options=('cases', 'deaths'), value='cases'), Output()), …

# Model/algorithm building

Although our preprocessing and exploratory data analysis included Italy, as we developed our model, we realized that the most interesting question revolved around how the polarity of what Twitter users were saying corresponded with case counts: do tweets become more positive as the curve flattens? 

The selected package, TextBlob, only works on English-language data, and as no one in our group spoke enough Italian to quality-check the translation provided by the googletrans package, we had to restrict our scope to the USA. 

The code used here could be recycled in future work to analyze English-language tweets in other countries, however our capacity was limited in scope.

In [10]:
from textblob import TextBlob
import re
from googletrans import Translator

#sentiment analysis

def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analyze_sentiment(tweet):
    analysis= TextBlob(clean_tweet(tweet))
    
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1

def sentimental_date_collector(data_set, d):
    for i in data_set['results']:
        date = date_transformer(i['created_at'][4:10])
        if not date in list(d.keys()):
            sentiment_dict = {'positive':0, 'neutral': 0, 'negative':0 }
            d[date] = sentiment_dict
        if i['lang'] == 'en':
            tweet = analyze_sentiment(i['text'])
            if tweet == 1:
                d[date]['positive'] += 1 
            elif tweet == -1:
                d[date]['negative'] += 1
            else:
                d[date]['neutral'] += 1
            
    return d

def sentiment_walker(files, d, state):
    if state == "ny":
        for i in range(len(files)):
            with open('Data/twitter-data/NY/'+files[i], 'r', encoding='utf8', errors='ignore') as f:
                region = json.load(f)
                sentimental_date_collector(region, d)
    elif state =="italy":
        for i in range(len(files)):
            with open('Data/twitter-data/IT/'+files[i], 'r', encoding='utf8', errors='ignore') as f:
                region = json.load(f)
                sentimental_date_collector_it(region, d)
    return d


In [11]:
sentimental_ny = {}
ny_json_files = []

ny_json_files = get_all_filesname(ny_json_files, 'ny')
sentimental_ny = sentiment_walker(ny_json_files, sentimental_ny,'ny')
df_ny_sentimental = pd.DataFrame.from_dict(sentimental_ny, orient= 'index')
ny_sent = df_ny_sentimental.sort_index()
ny_sent[:5]

Unnamed: 0,positive,neutral,negative
2020-02-12,6,4,3
2020-02-13,8,9,5
2020-02-14,6,6,6
2020-02-15,14,3,3
2020-02-16,11,5,2


 # Data visualization & interpretation

Figures 5 and 6 show behavior somewhat contradictory to what we were expecting. The tweets showing positive polarity outnumber the negative tweets on each day, and outnumber the neutral tweets in almost all cases. This could be evidencing the limitations of the TextBlob when faced with mixed sentiment (a tweet encouraging people to stay strong and stay home, for example, might yield a positive polarity), or it could truly be a finding: that the majority of people on Twitter made positive statements about coronavirus or their #lockdownLife. Considering that Twitter users tend to be better educated and wealthier than the broader public, we could also simply be seeing self-selection bias at work (Pew Research: https://www.pewresearch.org/internet/2019/04/24/sizing-up-twitter-users/). With the benefit of hindsight, we can consider that the total number of new cases did not peak until April 14th, which makes the increasing spread between positive and negative tweets shown in Figure 6 a bit of an empirical puzzle. 

Further research work on the connection between social media activity and the coronavirus would be needed in order to draw firmer conclusions. The empirical setup we have established with this study serves as an exploratory analysis and framework for further work on this topic.

In [20]:
df_list= [pd.DataFrame(dict(date=ny_sent.index,tweets=ny_sent[i],sentiment=i)) for i in ny_sent.columns]

fig = px.line(pd.concat(df_list), x='date', y='tweets',color='sentiment', title='Figure 5. Corona related tweet sentiment in New York')
fig.update_layout(plot_bgcolor='white', yaxis={'gridcolor' : 'gray'})
fig.show()

df_spread=pd.DataFrame(dict(date=ny_sent.index, spread=ny_sent['positive']-ny_sent['negative']))

fig2 = px.line(df_spread, x='date', y='spread', 
               title='Figure 6: Spread between positive and negative tweets',height=300)
fig2.show()