<h1><center>A Study of the Relationship of Precursory Symptoms of COVID-19 to Positive Test Results</center></h1>
<h6><center>Jacob Fein-Ashley</center></h6>

   This study will compare the relationship between concrete data of positive COVID tests in the US to trending searches of precursory COVID symptoms. Novel machine learning techniques such as classifcation and regression will be used to categorize data from Google search trends by state to precursory symptoms of COVID-19. At the end of the study, the relationship between COVID positive test results will be statistically compared to the unsupervised categorization of precursory symptoms to see if there is a correlation.

### Initialization of Data and Importing Data Analytic Dependencies

#### Importing Data Analytic and Dashboard Dependencies

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objs as go
from datetime import date
from IPython.core.display import display, HTML
from IPython.display import IFrame
import requests
from requests.auth import HTTPBasicAuth

#### Importing a Dictionary of State Codes

In [2]:
#dictionary for state codes
state_codes = {
    'District of Columbia' : 'dc','Mississippi': 'MS', 'Oklahoma': 'OK', 
    'Delaware': 'DE', 'Minnesota': 'MN', 'Illinois': 'IL', 'Arkansas': 'AR', 
    'New Mexico': 'NM', 'Indiana': 'IN', 'Maryland': 'MD', 'Louisiana': 'LA', 
    'Idaho': 'ID', 'Wyoming': 'WY', 'Tennessee': 'TN', 'Arizona': 'AZ', 
    'Iowa': 'IA', 'Michigan': 'MI', 'Kansas': 'KS', 'Utah': 'UT', 
    'Virginia': 'VA', 'Oregon': 'OR', 'Connecticut': 'CT', 'Montana': 'MT', 
    'California': 'CA', 'Massachusetts': 'MA', 'West Virginia': 'WV', 
    'South Carolina': 'SC', 'New Hampshire': 'NH', 'Wisconsin': 'WI',
    'Vermont': 'VT', 'Georgia': 'GA', 'North Dakota': 'ND', 
    'Pennsylvania': 'PA', 'Florida': 'FL', 'Alaska': 'AK', 'Kentucky': 'KY', 
    'Hawaii': 'HI', 'Nebraska': 'NE', 'Missouri': 'MO', 'Ohio': 'OH', 
    'Alabama': 'AL', 'Rhode Island': 'RI', 'South Dakota': 'SD', 
    'Colorado': 'CO', 'New Jersey': 'NJ', 'Washington': 'WA', 
    'North Carolina': 'NC', 'New York': 'NY', 'Texas': 'TX', 
    'Nevada': 'NV', 'Maine': 'ME', 'Guam' : 'GU', 'Northern Mariana Islands' : 'NU',
    'Puerto Rico' : 'PR', 'Virgin Islands' : 'VI'}

#### Data Initialization
Data is initialized, cleaned, and appropriately sorted. State codes are applied to state names such as to follow plotly convention.

In [3]:
#Start the dashboard using plotly
init_notebook_mode(connected=True)
#Import daily dataset and total datasets
df_garbage = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/live/us-states.csv")
#The "states_daily.csv" file contains a lot of data that is not useful in our study.
df_total_cases = df_garbage[['state','cases']].copy()
#Data is not properly labeled according to state code, so we need to convert.
df_total_cases['state'] = df_total_cases['state'].apply(lambda x : state_codes[x])

#### Displaying a Sample of the Dataset

In [4]:
#Exclude null values from the dataset
df_total_cases = df_total_cases.dropna()
#Display dataset
df_total_cases.head()

Unnamed: 0,state,cases
0,AL,159713
1,AK,9300
2,AZ,221080
3,AR,87013
4,CA,834716


### Data Visualization
Now that we have imported, cleaned, and examined the datasets involved in the concrete dataset section, we will visualize the data using the python "plotly" library

In [5]:
#Create an interactive table
fig1 = go.Figure(data=go.Choropleth(
    locations=df_total_cases['state'], # Spatial coordinates
    z = df_total_cases['cases'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'purpor',
    colorbar_title = "Positive Cases",
    marker_line_color='white', # line markers between states
))

fig1.update_layout(
    title_text = f"Total COVID-19 Cases by State ({date.today()})",
    geo_scope='usa', # limit map scope to USA
    #Centering the title
    title=dict(x=0.5)
)
fig1.show()

<h6><center>Figure 1</center></h6>

#### Total Cases
   Now that an interactive COVID test dashboard has been created for each state, a line chart for total cases across each month will be created.

In [6]:
#Importing data from public GitHub Repository exported from the New York Time's Data Source.
df_total_cases = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv")
fig2 = px.line(df_total_cases, x="date", y="cases", title='Total Positive COVID Tests in the US by Date')
fig2.update_layout(
    #Centering the title
    title=dict(x=0.5),
    xaxis_title="Date",
    yaxis_title="Positive COVID Tests",
    legend_title="Legend Title",
    font=dict(
        family="Times New Roman",
        size=14,
    )
)
fig2.show()

<h6><center>Figure 2</center></h6>

#### Creating a Chart of Daily Cases

In [7]:
#creating a data frame for daily cases in the USA
df_daily_cases = pd.read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/ecdc/new_cases.csv")
#only want the united states data
df_daily_cases = df_daily_cases[['date','United States']]
df_daily_cases.head()
fig3 = go.Figure()
fig3.add_trace(go.Bar(
    x=df_daily_cases['date'],
    y=df_daily_cases['United States'],
    name='Daily US COVID Cases',
    marker_color='indianred'
))
#Align the dates and turn cases into bars
fig3.update_layout(barmode='group', xaxis_tickangle=-45)
#Align the graph title and axes
fig3.update_layout(
    xaxis_title="Date",
    yaxis_title="Cases",
    title={
        'text': f"Daily US COVID Cases ({date.today()})",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig3.show()

<h6><center>Figure 3</center></h6>

### PyTrends (Google Trends API)
   Now that multiple analyses and data visualizations of total COVID cases are complete, the relationship between concrete COVID data and precursory symptoms will be analyzed via a Google Trends library in python. Various COVID symptoms are suggested and analyzed. Using unsupervised learning, related searches to different covid symptoms are analyzed in a line chart below.

In [8]:
#Importing the pytrends library                     
from pytrends.request import TrendReq
pytrend = TrendReq()
# Building the dataframes of several common COVID precursor symptoms.
#---Symptoms List---#
pytrend.build_payload(kw_list=['anosmia','diarrhea','cough', 'dry cough', 'fever'])
interest_df = pytrend.interest_over_time()
interest_df.to_csv('keyword_data.csv')
interest_df = pd.read_csv('keyword_data.csv')
#erase contents of keyword_data.csv
filename = "keyword_data.csv"
f = open(filename, "w+")
f.close()
interest_df.head()
#------------------#

Unnamed: 0,date,anosmia,diarrhea,cough,dry cough,fever,isPartial
0,2015-10-11,0,15,21,2,36,False
1,2015-10-18,0,15,22,2,35,False
2,2015-10-25,0,15,22,2,35,False
3,2015-11-01,0,15,22,2,36,False
4,2015-11-08,0,15,23,2,35,False


#### Visualization of Trends
The list of trends has been loaded into pandas dataframes. Now, a line chart will be developed containing the relative interest index of searches from Google Trends.

In [9]:
datasets = ['anosmia','diarrhea','cough', 'dry cough', 'fever']
fig4 = px.line(interest_df, x="date", y=datasets, title='Relative Interest of COVID Related Symptoms by Date',range_x=['2020-01-01', date.today()])
fig4.update_layout(
    #Centering the title
    title=dict(x=0.5),
    xaxis_title="Date",
    yaxis_title="Interest (Relative to 100)",
    legend_title="Legend Title",
    font=dict(
        size=14,
    )
)
fig4.show()

<h6><center>Figure 4</center></h6>

#### Continuation
From analyzing trends of just a few precursory COVID symptoms, it appears there is a correlation between active COVID cases and what users are Googling. So now, we will create a heat map of the total COVID related Google searches and compare to the total COVID cases in each state.

In [10]:
#Import the Google data of total interest
df_google_data = pd.read_csv("https://raw.githubusercontent.com/google-research/open-covid-19-data/master/data/exports/search_trends_symptoms_dataset/United%20States%20of%20America/2020_US_weekly_symptoms_dataset.csv")
df_google_data.head()

Unnamed: 0,open_covid_region_code,country_region_code,country_region,sub_region_1,sub_region_1_code,sub_region_2,sub_region_2_code,date,symptom:Abdominal obesity,symptom:Abdominal pain,...,symptom:Wart,symptom:Water retention,symptom:Weakness,symptom:Weight gain,symptom:Wheeze,symptom:Xeroderma,symptom:Xerostomia,symptom:Yawn,symptom:hyperhidrosis,symptom:pancreatitis
0,US-AK,US,United States,Alaska,US-AK,,,2020-01-06,,,...,,,,,,,,14.28,,
1,US-AK,US,United States,Alaska,US-AK,,,2020-01-13,,,...,,,,,,,,16.26,,
2,US-AK,US,United States,Alaska,US-AK,,,2020-01-20,,,...,,,,,,,,17.48,,
3,US-AK,US,United States,Alaska,US-AK,,,2020-01-27,,,...,,,,,,,,10.93,,
4,US-AK,US,United States,Alaska,US-AK,,,2020-02-03,,,...,,,,,,,,18.93,,
