<a href="https://colab.research.google.com/github/jacobfa/COVIDAnalysis/blob/master/GoogleSearchCOVID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center>A Study of the Relationship of Precursory Symptoms of COVID-19 to Positive Test Results</center></h1>
<h6><center>Jacob Fein-Ashley</center></h6>

   This study will compare the relationship between concrete data of positive COVID tests in the US to trending searches of precursory COVID symptoms. Novel machine learning techniques such as classifcation and regression will be used to categorize data from Google search trends by state to precursory symptoms of COVID-19. At the end of the study, the relationship between COVID positive test results will be statistically compared to the unsupervised categorization of precursory symptoms to see if there is a correlation.

### Initialization of Data and Importing Data Analytic Dependencies

#### Importing Data Analytic and Dashboard Dependencies

In [1]:
!pip install chart_studio
!pip install plotly --upgrade
!pip install pytrends


import pandas as pd
import numpy as np
import matplotlib as plt
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objs as go
import plotly.graph_objects as go
from datetime import date
from IPython.core.display import display, HTML
from IPython.display import IFrame
import requests
from requests.auth import HTTPBasicAuth
from ipywidgets import widgets
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import svm
from sklearn import metrics

def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))


Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.11.0)


### Google Searches
#### PyTrends(Google Trends API)
From a Python dependency called PyTrends, five of the most common COVID precursory symptoms will be presented and studied for their correlation to concrete COVID statistics.

In [2]:
#Importing the pytrends library                     
from pytrends.request import TrendReq
pytrend = TrendReq()
# Building the dataframes of several common COVID precursor symptoms.
#---Symptoms List---#
pytrend.build_payload(kw_list=['anosmia','diarrhea','cough', 'dry cough', 'fever'])
interest_df = pytrend.interest_over_time()
interest_df.to_csv('keyword_data.csv')
interest_df = pd.read_csv('keyword_data.csv')
#erase contents of keyword_data.csv
filename = "keyword_data.csv"
f = open(filename, "w+")
f.close()
interest_df.head()
#------------------#

Unnamed: 0,date,anosmia,diarrhea,cough,dry cough,fever,isPartial
0,2015-10-18,0,15,21,1,35,False
1,2015-10-25,0,15,22,2,35,False
2,2015-11-01,0,14,22,2,37,False
3,2015-11-08,0,15,22,2,35,False
4,2015-11-15,0,15,24,2,36,False


#### Visualization of Trends
The list of trends has been loaded into pandas dataframes. Now, a line chart will be developed containing the relative interest index of searches from Google Trends.

In [3]:
datasets = ['anosmia','diarrhea','cough', 'dry cough', 'fever']
fig4 = px.line(interest_df, x="date", y=datasets, title='Relative Interest of COVID Related Symptoms by Date',range_x=['2020-01-01', date.today()])
fig4.update_layout(
    #Centering the title
    title=dict(x=0.5),
    xaxis_title="Date",
    yaxis_title="Interest (Relative to 100)",
    legend_title="Legend Title",
    font=dict(
        size=14,
    )
)
#plot_url = py.plot(fig4, filename = 'Figure4')
fig4.show()

<h6><center>Figure 1</center></h6>

The relevance of Google search trends is measured from a scale of zero to one-hundred, regardless of search volume and population. From a simple view of symptoms related to COVID (anosmia, diarrhea, cough, dry cough, fever), we can see that there appears to be quite a strong correlation between google search relevance of symptoms and COVID. When COVID spiked in March, there is a clear sharp uptrend in relevance of searches. 

#### Continuation
From analyzing trends of just a few precursory COVID symptoms, it appears there is a correlation between active COVID cases and what users are Googling. So now, we will create a heat map of the total COVID related Google searches and compare to the total COVID cases in each state.

In [4]:
#Import the Google data of total interest
df_google_data = pd.read_csv("https://raw.githubusercontent.com/google-research/open-covid-19-data/master/data/exports/search_trends_symptoms_dataset/United%20States%20of%20America/2020_US_daily_symptoms_dataset.csv")

#### Data Cleaning
A dictionary of US state codes will be listed to assist analytic programs function properly.

In [5]:
#dictionary for state codes
state_codes = {
    'District of Columbia' : 'dc','Mississippi': 'MS', 'Oklahoma': 'OK', 
    'Delaware': 'DE', 'Minnesota': 'MN', 'Illinois': 'IL', 'Arkansas': 'AR', 
    'New Mexico': 'NM', 'Indiana': 'IN', 'Maryland': 'MD', 'Louisiana': 'LA', 
    'Idaho': 'ID', 'Wyoming': 'WY', 'Tennessee': 'TN', 'Arizona': 'AZ', 
    'Iowa': 'IA', 'Michigan': 'MI', 'Kansas': 'KS', 'Utah': 'UT', 
    'Virginia': 'VA', 'Oregon': 'OR', 'Connecticut': 'CT', 'Montana': 'MT', 
    'California': 'CA', 'Massachusetts': 'MA', 'West Virginia': 'WV', 
    'South Carolina': 'SC', 'New Hampshire': 'NH', 'Wisconsin': 'WI',
    'Vermont': 'VT', 'Georgia': 'GA', 'North Dakota': 'ND', 
    'Pennsylvania': 'PA', 'Florida': 'FL', 'Alaska': 'AK', 'Kentucky': 'KY', 
    'Hawaii': 'HI', 'Nebraska': 'NE', 'Missouri': 'MO', 'Ohio': 'OH', 
    'Alabama': 'AL', 'Rhode Island': 'RI', 'South Dakota': 'SD', 
    'Colorado': 'CO', 'New Jersey': 'NJ', 'Washington': 'WA', 
    'North Carolina': 'NC', 'New York': 'NY', 'Texas': 'TX', 
    'Nevada': 'NV', 'Maine': 'ME', 'Guam' : 'GU', 'Northern Mariana Islands' : 'NU',
    'Puerto Rico' : 'PR', 'Virgin Islands' : 'VI'}

#### Further Cleaning
The data has many null values and will be assigned to 0 if null. The data will also be categorized by date collected, state code, and total relevance. Remember that Google Trends scores relevance of a subject from 0-100, and we have 422 symptoms to sum up and collect data from.

In [6]:
df_google_data = df_google_data.fillna(0)
#Inputting proper state format
df_google_data['state'] = df_google_data['sub_region_1'].apply(lambda x : state_codes[x])
#Creating columns that sum up total relevance of symptom data
df_google_data['relevance'] = df_google_data.iloc[:, -422:-1].sum(axis=1)
df_google_data = df_google_data[['date','state','relevance']]
filename1 = "google_search_data.csv"
f1 = open(filename1, "w+")
f1.close()
df_google_data.to_csv("google_search_data.csv")

#### Data Visualization
Now that we have imported statistics from the beginning of 2020, we will create a visualization to see the interest of all COVID related symptoms over the past year.

In [7]:
%%HTML
<div class='tableauPlaceholder' id='viz1602450440539' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;CO&#47;COVIDDash1&#47;RelevanceDashboard&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='COVIDDash1&#47;RelevanceDashboard' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;CO&#47;COVIDDash1&#47;RelevanceDashboard&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1602450440539');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else { vizElement.style.width='100%';vizElement.style.height='777px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

<h6><center>Figure 2</center></h6>

#### Total Search Trends
Finally, a visualization to see search trends over 2020

In [8]:
%%HTML
<div class='tableauPlaceholder' id='viz1602449620662' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;CO&#47;COVIDDash1&#47;TotalRelevance&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='COVIDDash1&#47;TotalRelevance' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;CO&#47;COVIDDash1&#47;TotalRelevance&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1602449620662');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

<h6><center>Figure 3</center></h6>

### Predictive Modeling
Now that we have created different visualization tools and datasets containing data from Google searches, now, a predictive model will be created to predict the direction of COVID in states and overall trends.

In [9]:
# Changing the mode of the column to datetime format to work with scikit tools easier
#df_google_data['date'] = pd.to_datetime(df_google_data['date'])

#### Training and Test Data
The data will be split into training and test categories. The training data will include all data leading up to the prior week where total google searches were conducted. The test data will be google search values of the current or prior week. The model will be analyzed and deployed on the test data to confirm its validity and fit.

In [10]:
# Creating training and test values for the dates, states and relevancy.
X = df_google_data['date'].values
Y = df_google_data['relevance'].values
X_train, X_test = train_test_split(X , test_size = 0.2)
Y_train, Y_test = train_test_split(Y, test_size = 0.2)

#### Stochastic Gradient Descent
For the first model we will replicate, a stochastic gradient descent model will be fit to the training data and a graph of the model's relation of the relevance vs date will be compared to the next week's data. In other words, data will be forecast from our stochastic gradient descent model. Stochastic gradient descent involves inspecting each data point for our test data and finding the difference to the linear model that we are trying to create and minimizing. The formula used for our SGD model can be found below, where Q(w) is the model we are predicting, y hat are modeling values, y values are data points from our data, n is the number of data points.
$$Q(w) = \ \sum_{i=1}^{n} (\hat{y}_i-{y}_i)^2  \ $$

In [11]:
mdl = linear_model.SGDRegressor(loss="squared_loss", penalty="l2")
#mdl.fit(X_train, Y_train)
#Y_test_hat = mdl.predict(X_test)
#test_out = pd.DataFrame([Y_test_hat, Y_test], index=["Prediction", "Actual"]).transpose()
#val_fig = px.scatter(test_out, x="Prediction", y="Actual", title="Stochastic Gradient Descent model prediction of Total Search Relevance vs Date")

### COVID Statistics Dashboard

#### Data Initialization

In [12]:
#Start the dashboard using plotly
init_notebook_mode(connected=True)
#Import daily dataset and total datasets
df_garbage = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/live/us-states.csv")
#The "states_daily.csv" file contains a lot of data that is not useful in our study.
df_total_cases = df_garbage[['state','cases']].copy()
#Data is not properly labeled according to state code, so we need to convert.
df_total_cases['state'] = df_total_cases['state'].apply(lambda x : state_codes[x])
#Exclude null values from the dataset
df_total_cases = df_total_cases.dropna()
#Display dataset
df_total_cases.head()

Unnamed: 0,state,cases
0,AL,165342
1,AK,10608
2,AZ,225575
3,AR,92833
4,CA,857002


#### Interactive Dashboards

In [13]:
configure_plotly_browser_state()
#Create an interactive table
fig1 = go.Figure(data=go.Choropleth(
    locations=df_total_cases['state'], # Spatial coordinates
    z = df_total_cases['cases'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'purpor',
    colorbar_title = "Positive Cases",
    marker_line_color='white', # line markers between states
))

fig1.update_layout(
    title_text = f"Total COVID-19 Cases by State ({date.today()})",
    geo_scope='usa', # limit map scope to USA
    #Centering the title
    title=dict(x=0.5)
)
#plot_url = py.plot(fig1, filename = 'Figure1')
fig1.show()

#### Total Cases

In [14]:
configure_plotly_browser_state()
#Importing data from public GitHub Repository exported from the New York Time's Data Source.
df_total_cases = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv")
fig2 = px.line(df_total_cases, x="date", y="cases", title='Total Positive COVID Tests in the US by Date')
fig2.update_layout(
    #Centering the title
    title=dict(x=0.5),
    xaxis_title="Date",
    yaxis_title="Positive COVID Tests",
    legend_title="Legend Title",
    font=dict(
        family="Times New Roman",
        size=14,
    )
)
#plot_url = py.plot(fig2, filename = 'Figure2')
fig2.show()

#### Daily Cases

In [15]:
configure_plotly_browser_state()
#creating a data frame for daily cases in the USA
df_daily_cases = pd.read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/ecdc/new_cases.csv")
#only want the united states data
df_daily_cases = df_daily_cases[['date','United States']]
df_daily_cases.head()
fig3 = go.Figure()
fig3.add_trace(go.Bar(
    x=df_daily_cases['date'],
    y=df_daily_cases['United States'],
    name='Daily US COVID Cases',
    marker_color='indianred'
))
#Align the dates and turn cases into bars
fig3.update_layout(barmode='group', xaxis_tickangle=-45)
#Align the graph title and axes
fig3.update_layout(
    xaxis_title="Date",
    yaxis_title="Cases",
    title={
        'text': f"Daily US COVID Cases ({date.today()})",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

#plot_url = py.plot(fig3, filename = 'Figure3')
fig3.show()