<h1><center>The Relationship Between COVID-19 Spread, Temperature, and Relative Humidity</center></h1>
<h3><center>By Momo Rutkin, Mohammed Syed, and Ambika Natarajan </center></h3>
<h3><center>ENVS/CHEM 328 - Emory University Spring 2020 </center></h3>



## Introduction:

<p style= "text-indent: 25px;"> Around the world, daily life has been obstructed by the spread of SARS-CoV-2, a virus that is unfamiliar to the human population. The exact time and place of transmission from an animal reservoir to a human host is disputed, but the first observed outbreak was in Wuhan, China, in December of 2019. The arrival of the Lunar New Year meant that many people travelled out of Wuhan to visit their hometowns, spreading the virus throughout China. Apart from domestic travel, international travel contributed to the creation of a pandemic, concern rising when people who had not travelled started testing positive for the virus. This indicated community spread of COVID-19 (Wu, 2020). </p>

<p style= "text-indent: 25px;"> Travel is one of the leading contributors of disease spread, but once the virus is in several locations, additional factors such as population density, access to adequate healthcare, and the effectiveness of policy implementation can all have an impact on the spread of the virus. Another factor to consider with several subcomponents is regional climate. Certain temperatures and humidity levels can impact the ability for humans to contract a disease (LaFave, 2020). Additionally, the duration for which a virus survives might have a relationship with temperature, known as its seasonality (Langlois, 2020). The analysis reported in this paper does not separate these two variables — instead it seeks to draw any correlation with temperature and humidity and use that information as a starting point for further inquiry. </p>

## Data Collection

<p style= "text-indent: 25px;"> The data used in this study has been primary collected from the <a href="https://www.kaggle.com/c/covid19-global-forecasting-week-1/discussion">Johns Hopkins University Center for Systems Science and Engineering’s COVID-19 Forecasting Competition</a>, <a href="https://www.kaggle.com/noaa/gsod">the NOAA GSOD dataset</a>, and <a href="https://www.kaggle.com/tanuprabhu/population-by-country-2020">WorldOMeter's Population By Country 2020</a>. The data sets were chosen for their relatively clean and up-to-date data. After merging the data sets, the data of interest consists of <u>322 satellite reference points from 116 countries</u> from <u>2020-01-22 to 2020-04-11</u>. We have the dataset and the purpose of the data listed below </p>

### Joined Dataset From 01/22/2020 - Present (05/14/2020)

* [JHU COVID-19 Forecasting Competition](https://www.kaggle.com/c/covid19-global-forecasting-week-1/discussion) 
    * Province_State
    * Country_Region
    * Date
    * ConfirmedCases
    * Fatalities
    * Lat and Long 
    

* [the NOAA GSOD dataset](https://www.kaggle.com/noaa/gsod)
    * Relative Humidity (rh) per day 
    * Temperature (min, max, average) in Celcius per day
    * Wind Speed (wdsp) per day
    * Precipitation per day 
    * Temperature Variance per day 
    
    
* [WorldOMeter's Population By Country 2020](https://www.kaggle.com/tanuprabhu/population-by-country-2020)
    * Urban Population percentage per country 
    * Density per country (P/km^2)
    * Population per country 
    * Median age per country 
   
### Dataset From 1/22/2020 - Present (05/14/2020)   
* [Novel Corona Virus 2019 Dataset](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset)
    * Province_State
    * Country_Region
    * Date
    * ConfirmedCases
    * Fatalities


In [55]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/weather-cases/training_data_with_weather_info_week_1.csv
/kaggle/input/weather-cases/training_data_with_weather_info_week_4.csv
/kaggle/input/uncover/hackathon_file_readme.txt
/kaggle/input/uncover/OpenTable/restaurant-performance.csv
/kaggle/input/uncover/ontario_government/status-of-covid-19-cases-in-ontario (1).csv
/kaggle/input/uncover/ontario_government/confirmed-positive-cases-of-covid-19-in-ontario.csv
/kaggle/input/uncover/public_health_england/covid-19-daily-confirmed-cases.csv
/kaggle/input/uncover/public_health_england/covid-19-cases-by-county-uas.csv
/kaggle/input/uncover/ihme/projected-hospital-resource-use-based-on-covid-19-deaths.csv
/kaggle/input/uncover/USAFacts/confirmed-covid-19-deaths-in-us-by-state-and-county.csv
/kaggle/input/uncover/USAFacts/confirmed-covid-19-cases-in-us-by-state-and-county.csv
/kaggle/input/uncover/world_bank/total-covid-19-tests-performed-by-country.csv
/kaggle/input/uncover/us_cdc/us_cdc/u-s-chronic-disease-indicators-cdi.csv
/k

In [57]:
# Additional Imports 

# essential libraries
import math
import random
from datetime import timedelta
from IPython.core.display import HTML
#import googlemaps
from datetime import datetime



# storing and anaysis
import numpy as np
import pandas as pd


%matplotlib inline



# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import folium

# converter
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters() 

#offline plotting 
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)

In [58]:
from scipy import signal

def get2deriv(df, index, column, value):
    #pivot table of 2nd derive (growth factor)
    newdf = df.pivot_table(
        index=index, columns=column, values=value, aggfunc=np.sum
    ).fillna(method="ffill").fillna(0)
    # Growth factor: (delta Number_n) / (delta Number_n)
    #newdf['shift1'] = newdf.shift(1, axis=0)
    #newdf['shift2'] = newdf.shift(2, axis=0)
    newdf =  (newdf-newdf.shift(1, axis=0))/(newdf.shift(1, axis=0)-newdf.shift(2, axis=0))
    #newdf = newdf.drop(['shift1', 'shift2', 'value'], axis=1)
    
    newdf = newdf.replace(np.inf, np.nan).fillna(1.0)
    # Rolling mean (window: 7 days)
    newdf = newdf.rolling(3).mean().dropna().loc[:df[index].max(), :]
    # round: 0.01
    growth_value_df = newdf.round(2)
    #growth_value_df = smoothergf(growth_value_df, 0.5, 5)
    growth_value_df.tail()
    
    frame = growth_value_df.copy()
    N, K = frame.shape
    data = {'2ndderiv': frame.to_numpy().ravel('F'),
            column: np.asarray(frame.columns).repeat(N),
            index: np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=[index, column, '2ndderiv'])

def get2derivworld(df, index, value):
    #pivot table of 2nd derive (growth factor)
    newdf = df.pivot_table(
        index=index, values=value, aggfunc=np.sum
    ).fillna(method="ffill").fillna(0)
    # Growth factor: (delta Number_n) / (delta Number_n)
    #newdf['shift1'] = newdf.shift(1, axis=0)
    #newdf['shift2'] = newdf.shift(2, axis=0)
    newdf =  (newdf-newdf.shift(1, axis=0))/(newdf.shift(1, axis=0)-newdf.shift(2, axis=0))
    #newdf = newdf.drop(['shift1', 'shift2', 'value'], axis=1)
    
    newdf = newdf.replace(np.inf, np.nan).fillna(1.0)
    # Rolling mean (window: 7 days)
    newdf = newdf.rolling(3).mean().dropna().loc[:df[index].max(), :]
    # round: 0.01
    growth_value_df = newdf.round(2)
    #growth_value_df = smoothergf(growth_value_df, 0.5, 5)
    growth_value_df.tail()
    
    frame = growth_value_df.copy()
    N, K = frame.shape
    data = {'2ndderiv': frame.to_numpy().ravel('F'),
            index: np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=[index, '2ndderiv'])

def get1deriv(df, index, column, value):
    #pivot table of 2nd derive (growth factor)
    newdf = df.pivot_table(
        index=index, columns=column, values=value, aggfunc="sum"
    ).fillna(method="ffill").fillna(0)
    # Growth factor: (delta Number_n) / (delta Number_n)
    newdf = newdf.diff()
    newdf = newdf.replace(np.inf, np.nan).fillna(1.0)
    # Rolling mean (window: 7 days)
    newdf = newdf.rolling(3).mean().dropna().loc[:df[index].max(), :]
    # round: 0.01
    growth_value_df = newdf.round(2)
    #growth_value_df = smoothergf(growth_value_df, 0.5, 3)
    growth_value_df.tail()
    
    frame = growth_value_df.copy()
    N, K = frame.shape
    data = {'1stderiv': frame.to_numpy().ravel('F'),
            column: np.asarray(frame.columns).repeat(N),
            index: np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=[index, column,'1stderiv'])

def get1derivworld(df, index, value):
    #pivot table of 2nd derive (growth factor)
    newdf = df.pivot_table(
        index=index,values=value, aggfunc="sum"
    ).fillna(method="ffill").fillna(0)
    # Growth factor: (delta Number_n) / (delta Number_n)
    newdf = newdf.diff()
    newdf = newdf.replace(np.inf, np.nan).fillna(1.0)
    # Rolling mean (window: 7 days)
    newdf = newdf.rolling(3).mean().dropna().loc[:df[index].max(), :]
    # round: 0.01
    growth_value_df = newdf.round(2)
    #growth_value_df = smoothergf(growth_value_df, 0.5, 3)
    growth_value_df.tail()
    
    frame = growth_value_df.copy()
    N, K = frame.shape
    data = {'1stderiv': frame.to_numpy().ravel('F'),
            index: np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=[index, '1stderiv'])

def getdaily(df, index, column, value):
    #pivot table of 2nd derive (growth factor)
    newdf = df.pivot_table(
        index=index, columns=column, values=value, aggfunc="sum"
    ).fillna(method="ffill").fillna(0)
    # Growth factor: (delta Number_n) / (delta Number_n)
    newdf = newdf.diff()
    newdf = newdf.replace(np.inf, np.nan).fillna(1.0)
    # Rolling mean (window: 7 days)
    newdf = newdf.rolling(3).mean().dropna().loc[:df[index].max(), :]
    # round: 0.01
    growth_value_df = newdf.round(2)
    growth_value_df.tail()
    
    frame = growth_value_df.copy()
    N, K = frame.shape
    data = {'daily_case': frame.to_numpy().ravel('F'),
            column: np.asarray(frame.columns).repeat(N),
            index: np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=[index, column,'daily_case'])

def getlag(df, index, column, value):
    #pivot table of 2nd derive (growth factor)
    newdf = df.pivot_table(
        index=index, columns=column, values=value, aggfunc=np.sum
    ).fillna(method="ffill").fillna(0)
    # Growth factor: (delta Number_n) / (delta Number_n)
    newdf = newdf.shift(-7, axis=0)
    
    newdf = newdf.replace(np.inf, np.nan).fillna(0.0)
    # Rolling mean (window: 7 days)
    #newdf = newdf.rolling(3).mean().dropna().loc[:df[index].max(), :]
    # round: 0.01
    growth_value_df = newdf
    
    growth_value_df.tail()
    
    frame = growth_value_df.copy()
    N, K = frame.shape
    data = {'lag': frame.to_numpy().ravel('F'),
            column: np.asarray(frame.columns).repeat(N),
            index: np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=[index, column,'lag'])

def getlagdeath(df, index, column, value):
    #pivot table of 2nd derive (growth factor)
    newdf = df.pivot_table(
        index=index, columns=column, values=value, aggfunc=np.sum
    ).fillna(method="ffill").fillna(0)
    # Growth factor: (delta Number_n) / (delta Number_n)
    newdf = newdf.shift(-7, axis=0)
    
    newdf = newdf.replace(np.inf, np.nan).fillna(0.0)
    # Rolling mean (window: 7 days)
    #newdf = newdf.rolling(3).mean().dropna().loc[:df[index].max(), :]
    # round: 0.01
    growth_value_df = newdf
    
    growth_value_df.tail()
    
    frame = growth_value_df.copy()
    N, K = frame.shape
    data = {'lag_death': frame.to_numpy().ravel('F'),
            column: np.asarray(frame.columns).repeat(N),
            index: np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=[index, column,'lag_death'])

#global measure
def add_daily_measures(df):
    df.loc[0,'Daily Cases'] = df.loc[0,'ConfirmedCases']
    df.loc[0,'Daily Deaths'] = df.loc[0,'Fatalities']
    for i in range(1,len(df)):
        df.loc[i,'Daily Cases'] = df.loc[i,'ConfirmedCases'] - df.loc[i-1,'ConfirmedCases']
        df.loc[i,'Daily Deaths'] = df.loc[i,'Fatalities'] - df.loc[i-1,'Fatalities']
    #Make the first row as 0 because we don't know the previous value
    df.loc[0,'Daily Cases'] = 0
    df.loc[0,'Daily Deaths'] = 0
    return df

#smoothens out data for plotting (one line)
def smoother(df, index, col):
    noise = 2 * np.random.random(len(df[index])) - 1 # uniformly distributed between -1 and 1
    y_noise = df[col] + noise
    y_col = signal.savgol_filter(y_noise, 53, 3)
    return y_col

#smoothens growth factor 
def smoothergf(inputdata,w,imax):
    data = 1.0*inputdata
    data = data.replace(np.nan,1)
    data = data.replace(np.inf,1)
    #print(data)
    smoothed = 1.0*data
    normalization = 1
    for i in range(-imax,imax+1):
        if i==0:
            continue
        smoothed += (w**abs(i))*data.shift(i,axis=0)
        normalization += w**abs(i)
    smoothed /= normalization
    return smoothed


################## only for treemap ###############

class country_utils():
    def __init__(self):
        self.d = {}
    
    def get_dic(self):
        return self.d
    
    def get_country_details(self,country):
        """Returns country code(alpha_3) and continent"""
        try:
            country_obj = pycountry.countries.get(name=country)
            if country_obj is None:
                c = pycountry.countries.search_fuzzy(country)
                country_obj = c[0]
            continent_code = pc.country_alpha2_to_continent_code(country_obj.alpha_2)
            continent = pc.convert_continent_code_to_continent_name(continent_code)
            return country_obj.alpha_3, continent
        except:
            if 'Congo' in country:
                country = 'Congo'
            elif country == 'Diamond Princess' or country == 'Laos' or country == 'MS Zaandam'\
            or country == 'Holy See' or country == 'Timor-Leste':
                return country, country
            elif country == 'Korea, South' or country == 'South Korea':
                country = 'Korea, Republic of'
            elif country == 'Taiwan*':
                country = 'Taiwan'
            elif country == 'Burma':
                country = 'Myanmar'
            elif country == 'West Bank and Gaza':
                country = 'Gaza'
            else:
                return country, country
            country_obj = pycountry.countries.search_fuzzy(country)
            continent_code = pc.country_alpha2_to_continent_code(country_obj[0].alpha_2)
            continent = pc.convert_continent_code_to_continent_name(continent_code)
            return country_obj[0].alpha_3, continent
    
    def get_iso3(self, country):
        return self.d[country]['code']
    
    def get_continent(self,country):
        return self.d[country]['continent']
    
    def add_values(self,country):
        self.d[country] = {}
        self.d[country]['code'],self.d[country]['continent'] = self.get_country_details(country)
    
    def fetch_iso3(self,country):
        if country in self.d.keys():
            return self.get_iso3(country)
        else:
            self.add_values(country)
            return self.get_iso3(country)
        
    def fetch_continent(self,country):
        if country in self.d.keys():
            return self.get_continent(country)
        else:
            self.add_values(country)
            return self.get_continent(country)

### Merged JHU COVID-19 Forecasting Competition and NOAA JSOD Data

In [59]:
#added temp variance to weather_country
weather_country = pd.read_csv('/kaggle/input/weather-data-5/training_data_with_weather_info_week_5.csv', parse_dates=['Date'])
weather_country = weather_country.rename(columns={"country+province": "country_province"})
weather_country['Date'] = pd.to_datetime(weather_country['Date'], format = '%Y-%m-%d')
weather_country['temp variance'] = weather_country['max'] - weather_country['min']
weather_country = weather_country.drop(columns = ['stp', 'slp', 'dewp'])
#weather_country['LatLong'] = "("+ str(weather_country['Lat'].round(5)) + "," + str(weather_country['Long'].round(5)) + ")"
#weather_country['LatLong'].sample(10)
weather_country["latlong"] = list(zip(weather_country.Lat.round(6), weather_country.Long.round(6)))
weather_country.sample(1)

#c = weather_country.loc[weather_country['Country_Region'] == 'US']


Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities,country_province,Lat,Long,day_from_jan_first,temp,min,max,rh,ah,wdsp,prcp,fog,temp variance,latlong
21603,21604,,Niger,2020-03-19,0.0,0.0,Niger-,13.511667,2.125278,79,93.2,85.1,101.5,0.449923,0.155234,6.5,0.0,0,16.4,"(13.511667, 2.125278)"


### Joined WorldOMeter's Population By Country 2020 With Previous Dataset 

In [60]:
#added population and density per country to weather_country
pp = pd.read_csv("/kaggle/input/covid19-global-forecasting-locations-population/locations_population.csv")
pop = pd.read_csv("../input/population-by-country-2020/population_by_country_2020.csv")
# select only population
pop = pop.iloc[:, :10]
# rename column names
pop.columns = ['Country_Region', 'Population', 'Year Change', 'Net Change', 'Density P/km^2', 'Land Area','Migrants','Fert. Rate','Med. Age', 'Urban Pop %']
pop = pop.drop(columns = ['Year Change', 'Net Change','Land Area','Migrants','Fert. Rate'])
pop['Urban Pop %'] = pop['Urban Pop %'].replace({'N.A.':'0 %'})
pop['Urban Pop %'] = pop['Urban Pop %'].str.rstrip('%').astype('float')

# update populaion
cols = ['Burma', 'Congo (Brazzaville)', 'Congo (Kinshasa)', "Cote d'Ivoire", 'Czechia', 
        'Kosovo', 'Saint Kitts and Nevis', 'Saint Vincent and the Grenadines', 
        'Taiwan*', 'US', 'West Bank and Gaza']
pops = [54409800, 89561403, 5518087, 26378274, 10708981, 1793000, 
        53109, 110854, 23806638, 330541757, 4569000]
dense = [83, 16, 16, 83, 138, 159, 205, 284, 673, 35, 758]
medage = [29.2, 16.7, 19.5, 20.3, 43.3, 30.5, 36.5, 35.3, 42.3, 38.5, 21.9]
urbanpop = [31, 44, 67, 51, 74, 50, 31, 52, 78, 82, 76]
new_df = pd.DataFrame({'Country_Region': cols, 'Population': pops, 'Density P/km^2': dense, 'Med. Age': medage, 'Urban Pop %': urbanpop})
pop.update(new_df)
pop.replace(['South Korea', 'North Korea'], ['Korea, South', 'Korea, North'])
# merged data
weather_country = pd.merge(weather_country, pop, on='Country_Region', how='left')
weather_country['Population'] = weather_country['Population'].fillna(0)
weather_country.sample(3)



Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities,country_province,Lat,Long,day_from_jan_first,...,ah,wdsp,prcp,fog,temp variance,latlong,Population,Density P/km^2,Med. Age,Urban Pop %
9385,9386,Tianjin,China,2020-02-28,136.0,3.0,China-Tianjin,39.3054,117.323,59,...,0.379975,2.8,0.0,0,21.6,"(39.3054, 117.323)",0.0,,,
15872,15873,,Iceland,2020-02-17,0.0,0.0,Iceland-,64.9631,-19.0208,48,...,0.246917,18.2,99.99,1,7.2,"(64.9631, -19.0208)",340795.0,3.0,37.0,94.0
13636,13637,Reunion,France,2020-04-01,281.0,0.0,France-Reunion,-21.1351,55.2471,92,...,0.241183,11.4,0.0,0,4.9,"(-21.1351, 55.2471)",65244628.0,119.0,42.0,82.0


In [61]:
#display(weather_country.loc[weather_country['Population'] != 0.0])

#filled the population and density for places that weren't recorded 
for i in weather_country.index:
    if weather_country.loc[i, "Country_Region"] == 'Bangladesh':
        weather_country.loc[i, "Population"] = 164336258
        weather_country.loc[i, "Density P/km^2"] = 1265
        weather_country.loc[i, "Med. Age"] = 27.9
        weather_country.loc[i, "Urban Pop %"] = 37
    elif weather_country.loc[i, "Country_Region"] == 'Brazil':
        weather_country.loc[i, "Population"] = 212228418
        weather_country.loc[i, "Density P/km^2"] = 25
        weather_country.loc[i, "Med. Age"] = 33.2
        weather_country.loc[i, "Urban Pop %"] = 87
    elif weather_country.loc[i, "Country_Region"] == 'Kuwait':
        weather_country.loc[i, "Urban Pop %"] = 100
    elif weather_country.loc[i, "Country_Region"] == 'Holy See':
        weather_country.loc[i, "Urban Pop %"] = 100
        weather_country.loc[i, "Med. Age"] = 60
    elif weather_country.loc[i, "Country_Region"] == 'Andorra':
        weather_country.loc[i, "Med. Age"] = 44.9
    elif weather_country.loc[i, "Country_Region"] == 'San Marino':
        weather_country.loc[i, "Med. Age"] = 44.5
    elif weather_country.loc[i, "Country_Region"] == 'Dominica':
        weather_country.loc[i, "Med. Age"] = 34
    elif weather_country.loc[i, "Country_Region"] == 'Liechtenstein':
        weather_country.loc[i, "Med. Age"] = 43.4        
    elif weather_country.loc[i, "Country_Region"] == 'Diamond Princess':
        weather_country = weather_country.drop(index=i)
    elif weather_country.loc[i, "Country_Region"] == 'India':
        weather_country.loc[i, "Population"] = 1380004385 
        weather_country.loc[i, "Density P/km^2"] = 464
        weather_country.loc[i, "Urban Pop %"] = 34
        weather_country.loc[i, "Med. Age"] = 28.7
    elif weather_country.loc[i, "Country_Region"] == 'China':
        weather_country.loc[i, "Population"] = 1438116346 
        weather_country.loc[i, "Density P/km^2"] = 153
        weather_country.loc[i, "Med. Age"] = 38.4
        weather_country.loc[i, "Urban Pop %"] = 59
    elif weather_country.loc[i, "Country_Region"] == 'West Bank and Gaza':
        weather_country.loc[i, "Population"] = 4569000 
        weather_country.loc[i, "Density P/km^2"] = 758
        weather_country.loc[i, "Med. Age"] = 21.9
        weather_country.loc[i, "Urban Pop %"] = 76
    elif weather_country.loc[i, "Country_Region"] == 'Indonesia':
        weather_country.loc[i, "Population"] = 272884327 
        weather_country.loc[i, "Density P/km^2"] = 151
        weather_country.loc[i, "Med. Age"] = 31.1
        weather_country.loc[i, "Urban Pop %"] = 55
    elif weather_country.loc[i, "Country_Region"] == 'US':
        weather_country.loc[i, "Urban Pop %"] = 82
    elif weather_country.loc[i, "Country_Region"] == 'Venezuela':
        weather_country.loc[i, "Urban Pop %"] = 88
    elif weather_country.loc[i, "Country_Region"] == 'Monaco':
        weather_country.loc[i, "Urban Pop %"] = 100
        weather_country.loc[i, "Med. Age"] = 53.1
    elif weather_country.loc[i, "Country_Region"] == 'Singapore':
        weather_country.loc[i, "Urban Pop %"] = 100
    elif weather_country.loc[i, "Country_Region"] == 'Japan':
        weather_country.loc[i, "Population"] = 126559084 
        weather_country.loc[i, "Density P/km^2"] = 347
        weather_country.loc[i, "Med. Age"] = 48.6
        weather_country.loc[i, "Urban Pop %"] = 92
    elif weather_country.loc[i, "Country_Region"] == 'Sao Tome and Principe':
        weather_country.loc[i, "Population"] = 218241
        weather_country.loc[i, "Density P/km^2"] = 228
        weather_country.loc[i, "Med. Age"] = 19.3
        weather_country.loc[i, "Urban Pop %"] = 73
    elif weather_country.loc[i, "Country_Region"] == 'Korea, South':
        weather_country.loc[i, "Population"] = 51259674 
        weather_country.loc[i, "Density P/km^2"] = 527
        weather_country.loc[i, "Med. Age"] = 43.2
        weather_country.loc[i, "Urban Pop %"] = 81
    elif weather_country.loc[i, "Country_Region"] == 'Russia':
        weather_country.loc[i, "Population"] = 145920988 
        weather_country.loc[i, "Density P/km^2"] = 9
        weather_country.loc[i, "Med. Age"] = 40.3
        weather_country.loc[i, "Urban Pop %"] = 74
    elif weather_country.loc[i, "Country_Region"] == 'MS Zaandam':
        weather_country = weather_country.drop(index=i)
    elif weather_country.loc[i, "Country_Region"] == 'Korea, South':
        weather_country = weather_country.drop(index=i)
    elif weather_country.loc[i, "Country_Region"] == 'Pakistan':
        weather_country.loc[i, "Population"] = 219922471 
        weather_country.loc[i, "Density P/km^2"] = 287
        weather_country.loc[i, "Med. Age"] = 22
        weather_country.loc[i, "Urban Pop %"] = 37
    elif weather_country.loc[i, "Country_Region"] == 'Mexico':
        weather_country.loc[i, "Population"] = 128633396 
        weather_country.loc[i, "Density P/km^2"] = 66
        weather_country.loc[i, "Med. Age"] = 29.3
        weather_country.loc[i, "Urban Pop %"] = 80
    elif weather_country.loc[i, "Country_Region"] == 'Nigeria':
        weather_country.loc[i, "Population"] = 204968096
        weather_country.loc[i, "Density P/km^2"] = 226
        weather_country.loc[i, "Med. Age"] = 18.6
        weather_country.loc[i, "Urban Pop %"] = 50
    else:
        weather_country.loc[i, "Country_Region"] = weather_country.loc[i, "Country_Region"]  

# Cases per population 
weather_country['Urban Pop %'].fillna(0, inplace=True)
weather_country['Urban Pop %'] = weather_country['Urban Pop %']/ 100.0
weather_country['Cases_Million_People'] = round((weather_country['ConfirmedCases'] / weather_country['Population']) * 1000000)
weather_country['ln(Cases / Million People)'] = np.log(weather_country.Cases_Million_People + 1)

weather_country.sample(3)


eastasia  = ['Taiwan*', 'Taiwan', 'Mongolia', 'China', 'Japan', 'South Korea']
europe = ['Latvia', 'Switzerland', 'Liechtenstein', 'Italy', 'Norway', 'Austria', 'Albania',
          'United Kingdom', 'Iceland', 'Finland', 'Luxembourg', 'Belarus', 'Bulgaria', 
          'Guernsey', 'Poland', 'Moldova', 'Spain', 'Bosnia and Herzegovina', 'Portugal', 
          'Germany', 'Monaco', 'San Marino', 'Andorra', 'Slovenia', 'Montenegro', 'Ukraine',
          'Lithuania', 'Netherlands', 'Slovakia', 'Czechia', 'Malta', 'Hungary', 'Jersey', 
          'Serbia', 'Kosovo', 'France', 'Croatia', 'Sweden', 'Estonia', 'Denmark', 
          'North Macedonia', 'Greece', 'Ireland', 'Romania', 'Belgium']
a = []
for i in weather_country.index:
    if weather_country.loc[i, "Country_Region"] in eastasia:
        a.append("East Asia")
    elif weather_country.loc[i, "Country_Region"] in europe:
        a.append("Europe")
    elif weather_country.loc[i, "Country_Region"] == 'US':
        a.append("US")
    else:
        a.append("Rest Of World")

weather_country["Country_Group"] = a
weather_country.sample(1)

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities,country_province,Lat,Long,day_from_jan_first,...,fog,temp variance,latlong,Population,Density P/km^2,Med. Age,Urban Pop %,Cases_Million_People,ln(Cases / Million People),Country_Group
23332,23333,,Russia,2020-04-07,7497.0,58.0,Russia-,60.0,90.0,98,...,0,9.4,"(60.0, 90.0)",145920988.0,9.0,40.3,0.74,51.0,3.951244,Rest Of World


### Extracted and Added: 
* daily cases
* cases per million people
* daily case rate (1st deriv)
* daily case growth factor (2nd deriv)

## Result: Final Dataset (weather_country) from 01-22-2020 to 05-14-2020

In [62]:
second = get2deriv(weather_country, 'Date', 'latlong', 'ConfirmedCases')
weather_country = pd.merge(weather_country, second, on=['latlong','Date'], how='left')

daily = getdaily(weather_country, 'Date', 'latlong', 'ConfirmedCases')
weather_country = pd.merge(weather_country, daily, on=['latlong','Date'], how='left')

weather_country["daily_case"] = weather_country["daily_case"].fillna(0.0)

first = get1deriv(weather_country, 'Date', 'latlong', 'daily_case')
weather_country = pd.merge(weather_country, first, on=['latlong','Date'], how='left')

lag = getlag(weather_country, 'Date', 'latlong', 'ConfirmedCases')
weather_country = pd.merge(weather_country, lag, on=['latlong','Date'], how='left')

lag2 = getlagdeath(weather_country, 'Date', 'latlong', 'Fatalities')
weather_country = pd.merge(weather_country, lag2, on=['latlong','Date'], how='left')

weather_country["lag"] = weather_country["lag"].fillna(0.0)
weather_country["lag_death"] = weather_country["lag_death"].fillna(0.0)

first = get1deriv(weather_country, 'Date', 'latlong', 'lag')
first = first.rename(columns={"1stderiv": "lag_1stderiv"})
weather_country = pd.merge(weather_country, first, on=['latlong','Date'], how='left')


weather_country["2ndderiv"] = weather_country["2ndderiv"].fillna(1.0)
weather_country["1stderiv"] = weather_country["1stderiv"].fillna(0.0)
weather_country["lag"] = weather_country["lag"].fillna(0.0)
weather_country["lag_death"] = weather_country["lag_death"].fillna(0.0)
weather_country["lag_1stderiv"] = weather_country["lag_1stderiv"].fillna(0.0)
#display(weather_country.loc[weather_country['Country_Region'] == 'Germany'])
weather_country.sample(3)

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities,country_province,Lat,Long,day_from_jan_first,...,Urban Pop %,Cases_Million_People,ln(Cases / Million People),Country_Group,2ndderiv,daily_case,1stderiv,lag,lag_death,lag_1stderiv
278,279,,Algeria,2020-03-12,24.0,1.0,Algeria-,28.0339,1.6596,72,...,0.73,1.0,0.693147,Rest Of World,0.67,1.33,0.11,87.0,9.0,11.0
5128,5129,Northwest Territories,Canada,2020-05-13,5.0,0.0,Canada-Northwest Territories,62.442222,-114.394722,134,...,0.81,0.0,0.0,Rest Of World,1.0,0.0,0.0,0.0,0.0,0.0
27127,27356,Arizona,US,2020-05-10,11119.0,536.0,US-Arizona,33.7298,-111.4312,131,...,0.82,34.0,3.555348,US,1.18,391.33,17.0,0.0,0.0,-4224.67


## Result: Final Dataset (updated) from 01-22-2020 to 05-14-2020 
### DOES NOT contain climate determinant variables  

In [63]:
updated = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data.csv', parse_dates=['ObservationDate'])
latlongupdate = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')
updated["countstate"] = tuple(zip(updated['Country/Region'], updated['Province/State']))
updated['Province/State'].replace(np.nan, "Not Reported", inplace=True)
#updated["ObservationDate"]=pd.to_datetime(updated["ObservationDate"])
updated = updated.rename(columns={"ObservationDate": "Date"})
updated['Date'] = pd.to_datetime(updated['Date'], format = '%Y-%m-%d')

a = updated[['Country/Region', 'Province/State']]

a = a.drop_duplicates()
a.reset_index()
a['identify'] = a.index

#adding active cases 
updated = pd.merge(updated, a, on=['Country/Region', 'Province/State'], how='left')
updated['active cases'] = updated['Confirmed'] - updated['Recovered'] - updated['Deaths']

#adding derivatives 
second = get2deriv(updated, 'Date', 'identify', 'Confirmed')
updated = pd.merge(updated, second, on=['identify','Date'], how='left')

first = get1deriv(updated, 'Date', 'identify', 'Confirmed')
updated = pd.merge(updated, first, on=['identify','Date'], how='left')

updated["2ndderiv"] = updated["2ndderiv"].fillna(1.0)
updated["1stderiv"] = updated["1stderiv"].fillna(0.0)

#adding regions
eastasia  = ['Taiwan*', 'Taiwan', 'Mongolia', 'China', 'Mainland China','Japan', 'Korea, South','South Korea', 'Hong Kong']
europe = ['Latvia', 'Switzerland', 'Liechtenstein', 'Italy', 'Norway', 'Austria', 'Albania',
          'United Kingdom', 'Iceland', 'Finland', 'Luxembourg', 'Belarus', 'Bulgaria', 
          'Guernsey', 'Poland', 'Moldova', 'Spain', 'Bosnia and Herzegovina', 'Portugal', 
          'Germany', 'Monaco', 'San Marino', 'Andorra', 'Slovenia', 'Montenegro', 'Ukraine',
          'Lithuania', 'Netherlands', 'Slovakia', 'Czechia', 'Malta', 'Hungary', 'Jersey', 
          'Serbia', 'Kosovo', 'France', 'Croatia', 'Sweden', 'Estonia', 'Denmark', 
          'North Macedonia', 'Greece', 'Ireland', 'Romania', 'Belgium', 'UK']
a = []
updated['country_group'] = updated['Confirmed']
updated['country_group'] = 0

for i in updated.index:
    if updated.loc[i, "Country/Region"] in eastasia:
        updated.loc[i, "country_group"] = "East Asia"
    elif updated.loc[i, "Country/Region"] in europe:
        updated.loc[i, "country_group"] = "Europe"
    elif updated.loc[i, "Country/Region"] == 'US':
        updated.loc[i, "country_group"] = "US"
    else:
        updated.loc[i, "country_group"] = "Rest Of The World"

updated[updated['country_group']=="Rest Of The World"].head(3)


updated.sample(3)

Unnamed: 0,SNo,Date,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,countstate,identify,active cases,2ndderiv,1stderiv,country_group
16594,16595,2020-04-19,Anhui,Mainland China,2020-04-19 23:49:05,991.0,6.0,984.0,"(Mainland China, Anhui)",0,1.0,1.0,0.0,East Asia
7710,7711,2020-03-22,Not Reported,Mauritania,3/8/20 5:31,2.0,0.0,0.0,"(Mauritania, nan)",5617,2.0,0.67,0.0,Rest Of The World
4794,4795,2020-03-11,Ontario,Canada,2020-03-11T18:52:04,41.0,0.0,4.0,"(Canada, Ontario)",215,37.0,1.0,12.67,Rest Of The World


In [None]:
"""
#latlongupdate['countstate'] = tuple(zip(latlongupdate['Country/Region'], latlongupdate['Province/State']))
just_latlong = latlongupdate[['Country/Region','Province/State', 'Lat', 'Long']].copy()

#display(updated[updated['Country/Region']=="South Korea"])
just_latlong = just_latlong.replace('Korea, South', 'South Korea')


#display(just_latlong[just_latlong['Country/Region'].str.contains("Korea")])
just_latlong['Province/State'].replace(np.nan, "Not Reported", inplace=True)

wc = weather_country[['Country_Region', 'Province_State', 'Lat', 'Long']].copy()
wc = wc.rename(columns={'Country_Region': "Country/Region",'Province_State': 'Province/State' })
wc['Province/State'].replace(np.nan, "Not Reported", inplace=True)

nota = wc[~wc['Province/State'].isin(just_latlong['Province/State'])]
nota = nota.groupby(['Country/Region','Province/State'],as_index=False)['Lat','Long'].mean()
c = ['US', 'Macau','UK', 'UK', 'US', 'UK', 'UK', 'UK', 'UK', 'Germany', 'US', 'US', 'UK', 'UK', 'UK', 'UK', 'UK',
     'Czech Republic', 'The Bahamas', 'Republic of the Congo', 'Ivory Coast', 'Netherlands', 'Denmark', 'Palestine']
r =['Chicago', 'Macau', 'Isle of Man', 'Montserrat', 'Northern Mariana Islands', 'Turks and Caicos Islands', 
    'Falkland Islands (Malvinas)', 'Gibraltar','Cayman Islands', 'Bavaria', 'American Samoa', 'United States Virgin Islands', 
    'Channel Islands', 'Bermuda', 'Anguilla', 'British Virgin Islands', 'Not Reported', 'Not Reported', 'Not Reported', 'Not Reported', 'Not Reported',
   'Netherlands', 'Denmark', 'Not Reported']
la=[41.8339042,23.6356074, 54.2278829, 16.691357, 17.3076967, 21.5741504, -51.7206292, 36.1295735, 19.5081819, 48.8992765, -14.061727, 18.0672779,
   49.4582161, 32.3194245, 18.390315,18.5222738, 52.7602022, 49.7856662, 24.4229244, -0.6811523, 7.4662967, 52.1951016, 56.2128538, 31.8858324]
lo=[-88.0121503,114.4376334,-4.8523185,-60.2272795,143.2420346,-72.3505781,-60.6489884,-5.3883195,-81.1347306,9.1651538,-170.6672906,-65.2991668,
    -2.942905,-64.8364403,-63.4803453,-64.7114365,-6.813662,13.2321306,-78.2108881,10.3858451,-7.7921532,3.0367463,9.3001434,34.331614]
pp = pd.DataFrame(data={'Country/Region': c, 'Province/State':r, 'Lat':la, 'Long':lo})
just_latlong = just_latlong.append(nota)
just_latlong = just_latlong.append(pp)

just_latlong.sample(10)
print(len(c))
print(len(r))
print(len(la))
print(len(lo))

#with_updated.sample(2)
with_updated.isnull().any()
with_updated['bool_loc'] = pd.notnull(with_updated["Lat"]) 
a = with_updated[with_updated['Lat']=='Not Reported']
#a[a['Country/Region']!='UK'].sample(50)
#a = with_updated[with_updated['Country/Region']=='UK']
#a[a['Province/State']=='Not Reported'].sample(50)
#b[b['Province/State']== 'United Kingdom'].sample(7)
#display(with_updated[with_updated['Country/Region']=='Czechia'])
print(len(a))
print(len(with_updated))

"""

<p style= "text-indent: 25px;"> <u> Global inconsistencies in reporting cannot be corrected within the data set </u>, but attention to news articles discussing inconsistencies in reporting in specific locations can be helpful for a more critical analysis. The graph below, for example, shows how a <u> significant increase in the reporting of cases in Chicago, Illinois, from March 07 - March 13 </u> caused the acceleration of case spread to appear significantly higher than that of cities such as New York, which is now the epicenter of the pandemic (Correal, 2020). </p>

Zoom out in the graph to see how the U.S. growth factors compare to other locations around the world. 

In [64]:
local = weather_country.copy()
local.replace(np.inf, np.nan).fillna(0)

local = local.groupby(['Country_Region','Province_State','latlong', 'Lat', 'Long'],as_index=False)['1stderiv','2ndderiv'].mean()
local['sderiv'] = local['2ndderiv']
p = local[local.sderiv >= 1.2]
f = local[local.sderiv <= 0.8]
t = local['sderiv'] <1.2
a = local['sderiv'] > 0.8
g = local[t & a]

p['sderiv'] = p['2ndderiv']
p['ln(2ndderiv)'] = np.log(p.sderiv + 1)
px.set_mapbox_access_token('pk.eyJ1IjoibW9ydXRraW4iLCJhIjoiY2s1cnJhMzczMGdjaDNtcnR0M2h0NnR6cSJ9.87UtlJlwluWbZq4ioist-g')
df = px.data.carshare()
p['sderiv'] = p['2ndderiv']*10
fig = px.scatter_mapbox(p, lat="Lat", lon="Long",     color="2ndderiv", size="2ndderiv",
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=50, zoom=3)
fig.update_layout(
    hovermode='closest',
    title_text= "Locations With An Accelerating Growth Rate (GR > 1.2) on 03/07-03/13",
    mapbox=dict(
        center=go.layout.mapbox.Center(
            lat=41.1254,
            lon=-98.2651
        ),
        pitch=0,
        zoom=3
    )
)
fig.update_layout(legend_title_text='Growth Rate')

fig.show()


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



<p style= "text-indent: 25px;"> <u> This means that the data is also more reliable when examined over a longer interval of time</u>, assuming that regional testing either remains at a consistent level or increases gradually. </p>	

<p style= "text-indent: 25px;"> In the <b>“World Daily Case Count over Time Graph” </b>below, the growth of cases worldwide increased exponentially between mid-February and late March. By April, it appeared as if the increase in cases started to taper off, as per the general logistic trend that the pandemic would be expected to follow. </p>	

In [65]:
world = updated.copy()
world = updated.groupby(['Date'],as_index=False)['active cases', 'Confirmed', 'Deaths'].sum()
world['active cases'][world['active cases'] < 0] = 0



fig = go.Figure(data=[
    go.Bar(name='Cases', x=world['Date'], y=world['active cases']),
    go.Bar(name='Deaths', x=world['Date'], y=world['Deaths'])
])
# Change the bar mode
fig.update_layout(
    title='Daily Cases Growth',
    xaxis_title="Date",
    yaxis_title="# of Daily Cases and Deaths",
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig.update_layout(barmode='overlay',
                  title = {'y':0.9,
        'x':0.5,'text':'Worldwide Daily Cases and Deaths count over time', 'xanchor': 'center', 'yanchor': 'top'})
fig.show()

fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=world['Date'], y=world['active cases'],
                    mode='lines',
                    name='active cases'))


fig1.update_layout(
    title='Daily Cases Growth',
    xaxis_title="Date",
    yaxis_title="# of Daily Cases",
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig1.update_layout(
    title={
        'text': 'World Daily Case Count over time',
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'}
)

fig1.show()




Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



## A Global Overview: 

<p style= "text-indent: 25px;"> In the <b>“Coronavirus Cases Change over Time Graph”</b> below, a log scale is used to compare the number of cases across the world over time. As the number of cases increases at a given site, the marker at that site moves from blue to red. China, where the virus spread began, was the only country with a red marker in January. By the end of January, cases had been reported across East and Southeast Asia and Australia. Individual cases had also been reported in Canada, Germany and France, but none of those countries were implementing wide-scale testing. </p>

<p style= "text-indent: 25px;">  By March 1st, China remained the only location with a red marker, and orange markers appeared in South Korea, Italy, Iran and Japan. Countries transitioning to green markers included Spain, France and Germany. The majority of the sites within Europe and the Middle East were reporting cases at this time. The United States, Saudi Arabia, the African Continent (excluding Nigeria, Egypt and Algeria), and the South American Continent (excluding Brazil and Ecuador) had reported zero cases at this point in time. </p>

<p style= "text-indent: 25px;"> By March 20, the map was covered in markers — testing had gone up dramatically. The red to dark orange zones were now South Korea, China, Iran, Italy, Germany, France, Spain, and the United States. All of the areas that were not previously tracking cases were reporting case spread. By April 7th, the coverage of the map looked roughly the same, except that all of the markers had increased in size to represent the overall increase in the number of cases reported. At this point in time, the marker in China had become more orange as the country’s mitigation efforts yielded results. The United States now housed the epicenter of the pandemic. </p>


In [66]:
cases1 = updated.copy()
#cases['rh'] = cases['rh'].fillna(0)
grp4 = cases1.groupby(['Date', 'Country/Region'])['active cases', 'Confirmed', 'Deaths', 'Recovered'].sum()

grp4['ln_ConfirmedCases'] = np.log(grp4.Confirmed + 1) 
grp4 = grp4.reset_index()
grp4['Date'] = grp4['Date'].dt.strftime('%m/%d/%Y')
grp4['Country'] =  grp4['Country/Region']

fig = px.scatter_geo(grp4, locations="Country", locationmode='country names', 
                     color="ln_ConfirmedCases",size= "ln_ConfirmedCases", hover_name="Country/Region",hover_data = [grp4.Confirmed, grp4.Deaths, grp4.Recovered ],projection="natural earth",
                     animation_frame="Date",width=900, height=700,
                        color_continuous_scale="portland",
                     title='World Map of Log Coronavirus Cases Change Over Time (14 = 1.18M)')

fig.update(layout_coloraxis_showscale=True)



Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



<p style= "text-indent: 25px;"> Similarly, the <b>“COVID-19: Temperature By Country/Region Over Time”</b> graph below <u>shows how daily average temperatures change across the world with time</u>. South of the equator and in the tropics, countries all had dark orange or red markers in January and February, signifying average temperatures between approximately 70°F and 90°F. Moving towards March and April, the temperate regions in this area started to lower in temperature to averages of between 60°F and 75°F. Regions in the temperate zones north of the Equator fluctuated between 30°F to 65°F. Europe saw distinct warmer regions along its West coast compared to the rest of the continent. </p>


In [67]:
cases = weather_country.copy()
cases['rh'] = cases['rh'].fillna(0)
grp2 = cases.groupby(['Date', 'Country_Region'])['ConfirmedCases', 'daily_case', 'Fatalities'].sum()
grp1 = cases.groupby(['Date', 'Country_Region'])['rh', 'temp'].mean()
grp = pd.merge(grp2, grp1, on=['Country_Region', 'Date'], how='left')

grp['ln_ConfirmedCases'] = np.log(grp.ConfirmedCases + 1) 
grp = grp.reset_index()
grp['Date'] = grp['Date'].dt.strftime('%m/%d/%Y')
grp['Country'] =  grp['Country_Region']


grp['dumtemp'] = grp['temp']+50
grp['dumtemp'] = grp['dumtemp'].round(2)

fig = px.scatter_geo(grp, locations="Country", locationmode='country names', 
                     color="temp",size= "dumtemp", hover_name="Country_Region",hover_data = [grp.ConfirmedCases, grp.daily_case, grp.Fatalities ],projection="natural earth",
                     animation_frame="Date",width=900, height=700,
                    color_continuous_scale="portland", title='COVID-19: Temperature By Country/Region Over Time')

fig.update(layout_coloraxis_showscale=True)
fig.show()



Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



<p style= "text-indent: 25px;"> The <b>“COVID-19: Relative Humidity By Country/Region Over Time”</b> graph differs significantly from the temperature graph, and is therefore useful in furthering the regional climate analysis. As of January 22nd, much of North Africa, South Africa and Namibia have relative humidities of around 20% to 30%. The rest of the world generally saw higher average relative humidities ranging between 50% and 85%. Compared to daily average temperature, this variable changed rapidly. For example, on April 7th, several Eastern European countries reported relative humidity values in the ranges of between 10% to 40%, but the relative humidity values for these same countries had been consistently much higher. </p>


In [68]:
fig1 = px.scatter_geo(grp, locations="Country", locationmode='country names', 
                     color="rh",size= "rh", hover_name="Country_Region",hover_data = [grp.ConfirmedCases, grp.daily_case, grp.Fatalities ],projection="natural earth",
                     animation_frame="Date",width=900, height=700,
                        color_continuous_scale="portland",
                     title='COVID-19: Relative Humidity By Country/Region Over Time')

fig1.update(layout_coloraxis_showscale=True)
fig1.show()

<p style= "text-indent: 25px;">The <b>“Rate of New Cases / Density (Pop/km^2) With Temperature Color Gradient”</b> graph shows how the rate of change of case spread at specific locations can be correlated with population density and temperature. Both axes follow a logarithmic scale, the x-axis representing the population density and the y-axis representing the rate of cases reported. The size of the markers represent cases per million people. The color of the markers references temperature as before. From this graph alone, it is difficult to construct an argument. The locations with the faster rates seem to be at temperatures between 55°F to 65°F, but there is no overwhelming trend. The similarity in the size of the markers at higher rates of spread could suggest that there are perhaps case density thresholds that promote faster periods of virus spread.</p>

In [69]:
df = weather_country.copy()
df['latlongcount'] = list(zip(df.Country_Region, df.Province_State))
#ind = df['latlongcount'].to_numpy()
new = df[['Date','Country_Region','latlongcount', '1stderiv', 'Density P/km^2', 'temp', 'ConfirmedCases', 'Population']].copy()
#new['Date'] = pd.tslib.Timestamp(new['Date'])
new = new[new.Date == new.Date.max()]

new1 = new.groupby(['Country_Region'],as_index=False)['1stderiv', 'Population', 'Density P/km^2', 'temp'].mean()
new2 = new.groupby(['Country_Region'],as_index=False)['ConfirmedCases'].sum()
new = pd.merge(new1, new2, on='Country_Region', how='left')
new['Cases_Million_People'] = round((new['ConfirmedCases'] / new['Population']) * 1000000)

#df = px.data.gapminder()
fig = px.scatter(new, x="Density P/km^2", y="1stderiv",
           size="Cases_Million_People", color="temp", hover_name="Country_Region",
           log_y=True, log_x=True, size_max=55)
fig.update_layout(title='Rate of New Cases / Density (Pop/km^2) With Temperature(C) Color Gradient')
fig.update_layout(
    title={
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



## A Regional Breakdown: Within Europe

<p style= "text-indent: 25px;"> The impact of temperature on sites in different countries can be assessed in the context of Europe. The comparison will incorporate 3 sets of countries: <b>the United Kingdom and Spain, Belgium and Germany, and the Netherlands and France</b>. These countries have been paired based on similar temperature trends. </p>


In [70]:
c = weather_country.copy()
c['Province_State'] = c['Province_State'].fillna("Total")
#c = c.drop(columns = ['1stderiv', '2ndderiv'])
c2 = c.groupby(['Date', 'Country_Group', 'Country_Region'],as_index=False)['daily_case'].sum()
c = c.groupby(['Date', 'Country_Group', 'Country_Region'],as_index=False)['temp'].mean()
c = pd.merge(c, c2, on=['Date', 'Country_Group', 'Country_Region'], how='left')
c = c[['Date','Country_Group','Country_Region', 'daily_case', 'temp']]

europe = c[c['Country_Group']=="Europe"]
europe_n = europe[europe['Country_Region'].isin(['France', 'Germany', 'Netherlands', 'United Kingdom', 'Spain', 'Belgium'])]

fig5 = px.line(europe_n, x="Date", y="temp", color='Country_Region', title='Average Temperature (C) Change Over Time Per Country In Europe')
fig5.update_layout(
    xaxis_title="Date",
    yaxis_title="Temperature (C)",
    title={
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'}
    #font=dict(
    #    family="Courier New, monospace",
    #    size=14,
    #    color="#7f7f7f"
    #)
)
fig5.show()

fig6 = px.line(europe_n, x="Date", y="daily_case", color='Country_Region', title='Average Rate Of Cases Change Over Time In Countries In Europe')
fig6.update_layout(
    xaxis_title="Date",
    yaxis_title="Rate of Daily Cases",
    title={
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'}
    #font=dict(
    #    family="Courier New, monospace",
    #    size=14,
    #    color="#7f7f7f"
    #)
)
fig6.show()

### Netherlands and France
<p style= "text-indent: 25px;"> The Netherlands and France both had warmer daily average temperatures ranging from 70.94°F to 78.92°F and 65.37°F to 73.04°F, respectively. There is no overlap between their trends, but on February 4th there was a difference in temperature between the countries of about 0.5°F. The rates of cases for these two countries changed greatly with time. On March 15, the Netherlands reported a rate of 33.2 cases per day while France reported a rate of 38.7 cases per day. On March 28, the Netherlands reported a rate of 235.1 cases per day while France reported a rate of 388.9 cases per day. On April 4th, this difference reached its maximum with the Netherlands reporting a rate of 193.9 cases per day and France reporting a rate of 1,405.4 cases per day.</p> 

### Belgium and Germany
<p style= "text-indent: 25px;"> Belgium and Germany also displayed similar temperature trends ranging from 34.60°F to 62.70°F and 27.1°F to 55.6°F, respectively. There is a high level of overlap between their temperature trends. The difference in rates for Belgium and Germany was even more pronounced than it was for the Netherlands and France. On March 15th, Belgium reported a rate of 163.5 cases per day while Germany reported a rate of 1,060 cases per day. On March 29, Belgium reported a rate of 1776.0 cases per day while Germany reported a rate of 5612 cases per day. On April 4th, Belgium reported a rate of 1541.5 cases per day while Germany reported a rate of 5649.0 cases per day. </p>

### United Kingdom and Spain
<p style= "text-indent: 25px;"> The United Kingdom and Spain also had similar intermediate temperature ranges from 60.91°F to 68.29°F and 52.9°F to 60.10°F. There is some overlap between their temperature ranges. Their differences in rates of cases were even more pronounced. On March 15th, the United Kingdom reported a rate of 15.6 cases per day compared to Spain’s 1,283 cases per day. On March 28, the United Kingdom reported a rate of 250 cases per day in comparison to Spain’s 7,724.5 cases per day. On April 4th, the rate for the UK was at 377.5 cases per day in comparison to the rate for Spain at 7,051.5 cases per day. </p>


### Conclusion from comparison
<p style= "text-indent: 25px;"> <u>From these comparisons alone, it is difficult to draw a distinct correlation between temperature and case spread. </u> On the basis of the rate curves, rather than temperature, it looks like Spain should be paired with Germany, France should be paired with Belgium, and the Netherlands should be paired with the UK. </p>

<p style= "text-indent: 25px;"> While most of the pairings involve a country from the intermediate temperatures being paired with a country from an extreme temperature category, the only real anomaly is the pairing of the Netherlands with the United Kingdom. This could perhaps be the result of the implementation of certain policies. Both countries initially considered herd immunity strategies, shutting down institutions and direct contact services but leaving other services and businesses open while maintaining distancing (Holligan, 2020). The United Kingdom also experienced difficulties in obtaining functional testing kits as of April 16th, in an effort to step up testing from 10,000 people per day to 100,000 people per day (David, 2020). This may have impacted the scale of cases reported. </p>

<p style= "text-indent: 25px;"> While insufficient information might rule out a straightforward relationship for the Netherlands and the UK, a more reliable comparison might be France and Belgium. France went into a rigid lockdown early in the onset of their epidemic and had a relatively high hospital capacity (Nossiter, 2020). Belgium has also attempted to be very transparent in their case counts, reporting a higher death rate than most countries by including deaths that are presumed, rather than simply confirmed, to be the result of the coronavirus (“Belgium unveils plans to lift lockdown”, 2020). The differences in temperatures might be too wide to draw any significant conclusions. </p>


## A Regional Breakdown: The United States, Europe and East Asia



In [90]:
updated_groups = updated.groupby(['Date', 'country_group'],as_index=False)['active cases', 'Confirmed'].sum()
updated_groups['active cases'][updated_groups['active cases'] < 0] = 0
updated_groups = updated_groups.groupby('country_group').resample('W-Mon', on='Date').sum().reset_index().sort_values(by='Date')
updated_groupsl = updated_groups.copy()

wd1 = weather_country.groupby(['Date', 'Country_Group'],as_index=False)['daily_case', 'ConfirmedCases'].sum()
wd1['daily_case'][wd1['daily_case'] < 0] = 0
wd1 = wd1.groupby('Country_Group').resample('W-Mon', on='Date').sum().reset_index().sort_values(by='Date')
wd2 = wd1.copy()

#added mean temp 
a = weather_country.copy()
updated_groups3 = a.groupby(['Date', 'Country_Group'],as_index=False)['temp'].mean()
updated_groups3 = updated_groups3.rename(columns={"Country_Group": "country_group"})
updated_groups = pd.merge(updated_groups, updated_groups3, on=['country_group','Date'], how='left')
updated_groups['temp'] = updated_groups['temp'].fillna("hi")

#only kept the data points where temp is recorded 
updated_groups = updated_groups[updated_groups.temp != 'hi']

#adding derivatives for the regions UPDATEDGROUPS
second = get2deriv(updated_groupsl, 'Date', 'country_group', "active cases")
updated_groupsl = pd.merge(updated_groupsl, second, on=['country_group','Date'], how='left')

first = get1deriv(updated_groupsl, 'Date', 'country_group', "active cases")
updated_groupsl = pd.merge(updated_groupsl, first, on=['country_group','Date'], how='left')

updated_groupsl["2ndderiv"] = updated_groupsl["2ndderiv"].fillna(1.0)
updated_groupsl["1stderiv"] = updated_groupsl["1stderiv"].fillna(0.0)

#adding derivatives for the regions WEATHER-COUNTRY
second = get2deriv(wd2, 'Date', 'Country_Group', "daily_case")
wd2 = pd.merge(wd2, second, on=['Country_Group','Date'], how='left')

first = get1deriv(wd2, 'Date', 'Country_Group', "daily_case")
wd2 = pd.merge(wd2, first, on=['Country_Group','Date'], how='left')

wd2["2ndderiv"] = wd2["2ndderiv"].fillna(1.0)
wd2["1stderiv"] = wd2["1stderiv"].fillna(0.0)

updated_groupsl = updated_groupsl.loc[updated_groupsl['Date'] <='2020-05-11']



##########using weather data ###############
wupdated = weather_country.copy()
wupdated_groups1 = wupdated.groupby(['Date', 'Country_Group'],as_index=False)['daily_case', 'ConfirmedCases'].sum()
wupdated_groups2 = wupdated.groupby(['Date', 'Country_Group'],as_index=False)['rh', 'temp'].mean()
wupdated_groups = pd.merge(wupdated_groups1, wupdated_groups2, on=['Country_Group','Date'], how='left')
wupdated_groups = wupdated_groups.groupby('Country_Group').resample('W-Mon', on='Date').sum().reset_index().sort_values(by='Date')

second = get2deriv(wupdated_groups, 'Date', 'Country_Group', 'ConfirmedCases')
wupdated_groups = pd.merge(wupdated_groups, second, on=['Date', 'Country_Group'], how='left')

first = get1deriv(wupdated_groups, 'Date', 'Country_Group', 'daily_case')
wupdated_groups = pd.merge(wupdated_groups, first, on=['Date', 'Country_Group'], how='left')

wupdated_groups["2ndderiv"] = wupdated_groups["2ndderiv"].fillna(1.0)
wupdated_groups["1stderiv"] = wupdated_groups["1stderiv"].fillna(0.0)


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



In [92]:
fig = px.line(updated_groupsl, x="Date", y="active cases", color='country_group')



fig.update_layout(
    title='Regional Daily Cases over time until 05/11',
    xaxis_title="Date",
    yaxis_title="# of Daily Cases",
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig.show()

In [96]:
fig = px.line(updated_groupsl, x="Date", y="1stderiv", color='country_group')



fig.update_layout(
    title='Regional Daily Rate Of Change over time until 5/11',
    xaxis_title="Date",
    yaxis_title="# of Daily Cases Rate",
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig.show()

In [97]:
ca = weather_country.copy()
ca['Province_State'] = ca['Province_State'].fillna("Total")
#c = c.drop(columns = ['1stderiv', '2ndderiv'])
cb = ca.groupby(['Date', 'Country_Group'],as_index=False)['daily_case'].sum()
ca = ca.groupby(['Date', 'Country_Group'],as_index=False)['temp'].mean()
ca = pd.merge(ca, cb, on=['Date', 'Country_Group'], how='left')

fig1 = px.line(ca, x="Date", y="temp", color='Country_Group', title='Average Temperature Change Over Time Per Region until 4/11')
fig1.update_layout(
    xaxis_title="Date",
    yaxis_title="Temperature (C)",
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig1.show()

#fig2 = px.line(ca, x="Date", y="daily_case", color='Country_Group', title='Average Temperature Change Over Time Per Region')
#fig2.show()


<p style= "text-indent: 25px;"> The use of temperature and humidity are likely more applicable on a larger, subcontinent scale. For an analysis at this level, satellite points from the United States, Europe and East Asia will be compared. The figure below for <b>the Average Rate of Cases with respect to Temperature</b> graph shows that the majority of cases fall within the range of 40°F to 70°F, suggesting a larger range than the ranges used for comparison within Europe. The majority of the satellite points from Europe actually fall outside of this window, on either side of 80°F. Their rates within this window are around the range of 0.4 to 9.0 cases per day. East Asia sees similar rates within the 40°F to 70°F window, with rates ranging around 0.05 to 21.4 cases per day. </p>

<p style= "text-indent: 25px;"> The United States differs greatly in this same window, with a range of about 11.8 to 9.9 thousand cases per day. This is significant, because the initiation of mitigation procedures occurred around early to mid-March for both Europe and the United States, whereas for East Asia, some nations had been implementing mitigation practices since January (Gan, 2020). While European countries mainly responded with a federally-mandated lockdown, the variation of policies in the United States as well as costly access to healthcare were likely major contributors to the variation in rates (IMF, 2020). Based on this graph, the temperature difference could be a factor as well. </p>


In [123]:
df = weather_country.copy()
df['latlongcount'] = df['Country_Region'] + ',' + df['Province_State']
#ind = df['latlongcount'].to_numpy()
new = df[['Date','latlongcount', 'Country_Group','lag_1stderiv', 'lag','Density P/km^2', 'temp', 'ConfirmedCases', 'Population', 'rh']].copy()
#new['Date'] = pd.tslib.Timestamp(new['Date'])
new = new[new.Date > '2020-04-30']
new = new[new.Date < '2020-05-05']

new1 = new.groupby(['latlongcount', 'Country_Group'],as_index=False)['lag_1stderiv', 'lag','Population', 'Density P/km^2', 'temp', 'rh'].mean()
new2 = new.groupby(['latlongcount'],as_index=False)['ConfirmedCases'].max()
new = pd.merge(new1, new2, on='latlongcount', how='left')
new['Cases_Million_People'] = round((new['ConfirmedCases'] / new['Population']) * 1000000)

#df = px.data.gapminder()
fig = px.scatter(new, x="temp", y="lag_1stderiv",
            color="Country_Group", hover_name="latlongcount",
           log_y=True,log_x=True, size_max=55)
fig.update_layout(title='4/30-5/05 Average Rate of New Cases / Temperature(C)')
fig.update_layout(
    xaxis_title="Temperature (C)",
    yaxis_title="Case Rate",
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig.show()


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



<p style= "text-indent: 25px;"> <b>The Average Rate of Cases versus Relative Humidity Graph</b> below gives a slightly different perspective. Most of Europe’s cases fall between 70% to 80%, whereas most of the cases in the US fall between 60% to 70%, with some cases between 50% to 60%. This suggests that drier climates might have higher transmission rates for the virus. East Asia has lower rates of case spread overall, but there are no distinct groupings by relative humidity within this category. </p>


In [124]:
fig1 = px.scatter(new, x="rh", y="lag_1stderiv",
            color="Country_Group", hover_name="latlongcount",
           log_y=True,log_x=True, size_max=55)
fig1.update_layout(title='4/30-5/05 Average Rate of New Cases / Realtive Humidity')
fig1.update_layout(
    xaxis_title="Relative Humidity",
    yaxis_title="Case Rate",
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig1.show()

<p style= "text-indent: 25px;"> The graph that shows <b> the average rate of new cases vs. temperature vs. relative humidity </b> bewlow could be another indicator that temperatures between 40°F and 70°F and relative humidity values between 60% to 80% might be optimal for the spread of the virus. When viewing the graph with temperature as the primary x-axis, a somewhat parabolic shape is observed, with a peak around 50°F. When viewing the graph with relative humidity as the primary x-axis, the majority of the parabolic arrangement of satellite points fits within the 60% to 80% relative humidity window. </p>

In [125]:
fig = px.scatter_3d(new, x='rh', y='temp', z='ConfirmedCases',
              color='Country_Group', log_z=True)
fig.update_layout(title='4/3-4/11 Average Rate of New Cases / Temp (C) / Relative Humidity')
fig.update_layout(
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    )
)

fig.show()

## Data Projection: 

<p style= "text-indent: 25px;"> Based on the data collected below, a projection was run for the predicted cases per location based on a number of climate and population determinants which includes temperature and relative humidity. These models were chosen for their robustness to outliers and predictive accuracy on log transformed data </p>

### The models used for the predictive models includes:
* **Linear Regression** 
    * <u>metrics:</u> 
        * Mean Absolute Error (MAE)
        * <u>reason:</u> more robust to outliers 
        * R Squared (R^2)
        * <u>reason:</u> shows how well our line fits the data 
* **Random Forest** 
    * 
        * Mean Squared Error (NMSE)
        * <u>reason:</u> need to highlight variance 
        * R Squared (R^2)
        * <u>reason:</u> shows how well our line fits the data 

* <u>NOTE:</u> The <b>R^2 value</b> has a couple of different scales, the scale used for this speicific R^2 metric is from negative infinity to 1. A negative score indicates that the result cannot fit because of the non-linearity of the data, and does not capture the variance in the data at all. 
    * the equation is: ```1 - residual sum of square / total sum of squares```

### The variables used for the predictive models include:
* Population
* Wind Speed daily 
* Precipitation daily 
* Fog daily 
* Average Temperature (C) daily 
* Min Temperature (C) daily 
* Max Temperature (C) daily 
* Temperature Variance (C) daily 
* Relative Humidity 
* Density Pop/km^2 per Country 
* Median Age per Country 
* Urban Pop % per Country 

### Training and Test Set Length 
* <u>train set</u>: 01/22/2020 - 03/22/2020 (80% of the data) 8-week training 
* <u>test set</u>: 03/23/2020 - 04/11/2020 (20% of the data) 3-week prediction 

In [126]:
weather_country.tail(3)

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities,country_province,Lat,Long,day_from_jan_first,...,Urban Pop %,Cases_Million_People,ln(Cases / Million People),Country_Group,2ndderiv,daily_case,1stderiv,lag,lag_death,lag_1stderiv
35451,35680,,Zimbabwe,2020-05-12,36.0,4.0,Zimbabwe-,-17.829167,31.052222,133,...,0.38,2.0,1.098612,Rest Of World,1.0,0.33,0.0,0.0,0.0,0.0
35452,35681,,Zimbabwe,2020-05-13,37.0,4.0,Zimbabwe-,-17.829167,31.052222,134,...,0.38,2.0,1.098612,Rest Of World,1.0,0.33,-0.11,0.0,0.0,0.0
35453,35682,,Zimbabwe,2020-05-14,37.0,4.0,Zimbabwe-,-17.829167,31.052222,135,...,0.38,2.0,1.098612,Rest Of World,1.0,0.33,-0.11,0.0,0.0,0.0


### Predictor #1: Linear Regression 

In [143]:
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

def lin_reg(X_train, Y_train, X_test, regr):
    # Create linear regression object
    #regr = linear_model.LinearRegression()

    # Train the model using the training sets
    regr.fit(X_train, Y_train)
    param = regr.get_params()

    # Make predictions using the testing set
    y_pred = regr.predict(X_test)
    
    return regr, y_pred


In [144]:
data = weather_country.copy()
# Apply log transformation to all ConfirmedCases and Fatalities columns, except for trends
data['ConfirmedCases'] = data['ConfirmedCases'].astype('float64')
data['Fatalities'] = data['Fatalities'].astype('float64')

# Replace infinites
data.replace([np.inf, -np.inf], 0, inplace=True)
data = data.fillna(0)
features = ['Country_Region','Province_State','ConfirmedCases','Population','wdsp', 'prcp','fog','temp', 'rh', 'min', 'max','temp variance','Density P/km^2',
       'day_from_jan_first','Med. Age','Urban Pop %' ]
data = data[features]

data.sort_values(by=['day_from_jan_first'])
train = data.loc[data['day_from_jan_first'] <=109]
test = data.loc[data['day_from_jan_first'] > 110]

trainy = train[['ConfirmedCases']].copy()
trainy_nolog = trainy.copy()
trainy = trainy.apply(lambda x: np.log1p(x))

trainx = train.drop(columns = ['Country_Region', 'Province_State','ConfirmedCases']).copy()

testy = test[['ConfirmedCases']].copy()
testy_nolog = test[['ConfirmedCases']].copy()
testy = testy.apply(lambda x: np.log1p(x))

testx = test.drop(columns = ['Country_Region', 'Province_State','ConfirmedCases']).copy()



data.sample(10)
param = 0

regr = linear_model.LinearRegression()
r, ypred= lin_reg(trainx, trainy, testx, regr)

r2 = r2_score(testy, ypred)
mse = mean_absolute_error(testy, ypred)
#msle = mean_squared_log_error(testy, ypred)
print("R squared Value (Variability Explained)")
print(r2)
print(' ')
print("Mean Absolute Error Value (Variance)")
print(mse)




R squared Value (Variability Explained)
-0.08580347625673723
 
Mean Absolute Error Value (Variance)
2.1799030771167223


In [146]:
s = test.copy()
s['predicted'] = ypred
s['real'] = testy
a = s.groupby(['day_from_jan_first'],as_index=False)['predicted', 'real'].sum()
#s.head(10)

fig = go.Figure()
fig.add_trace(go.Scatter(x=a['day_from_jan_first'], y=a['real'],
                    mode='lines',
                    name='Real Confirmed Cases'))
fig.add_trace(go.Scatter(x=a['day_from_jan_first'], y=a['predicted'],
                    mode='lines',
                    name='Predicted Confirmed Cases' ))

#fig.add_shape(
#        # Line Horizontal
#            type="line",
#            x0=90,
#            y0=1600,
#            x1=102,
#            y1=1600,
#            line=dict(
#                color="LightSeaGreen",
#                width=2,
#                dash="dashdot",
#            ),
#    )


fig.update_layout(title='Regression Predictive Model: 04/20/2020 - 05/15/2020')
fig.update_layout(
    #font=dict(
    #    family="Courier New, monospace",
    #    size=14,
    #    color="#7f7f7f"
    #)
)


fig.show()


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



### Predictor #2: Random Forest 

In [147]:

rfcla = RandomForestRegressor(n_estimators=100, max_samples=0.8,
                        random_state=1)
# We train model
rfcla.fit(trainx, trainy)
predictions = rfcla.predict(testx)
#roc_value = roc_auc_score(testy, predictions)

fi = pd.DataFrame({'feature': list(trainx.columns),
                   'importance': rfcla.feature_importances_}).\
                    sort_values('importance', ascending = False)

r2 = r2_score(testy, predictions)
print("R squared Value (Variability Explained)")
print(r2)
print(' ')
print("Mean Squared Error Value (Variance):")
mse = mean_squared_error(testy, predictions)
print(mse)
fi.head()


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



R squared Value (Variability Explained)
0.6304775929274251
 
Mean Squared Error Value (Variance):
2.4456304051361033


Unnamed: 0,feature,importance
10,day_from_jan_first,0.548336
0,Population,0.194605
11,Med. Age,0.064649
4,temp,0.059219
9,Density P/km^2,0.030074


### Predictive Models Conclusion 

<p style= "text-indent: 25px;"> When looking at the linear regression model, we see a R^2 value of ```-0.77``` which means that none of the variability can be explained in the regression line despite the mean absolute error being relatively low. However, when we perform a Random Forest ensemble which averages the results of 100 decidion regression trees, we start to see some correlation between the variables and the results. The R^2 value for the Random Forest Regressor (RF) is ```0.187``` which is significantly higher than the regressor. This increase in accuracy can be due to the fact that RF is more robust to outliers and it can predict non-linear data while linear regression can only predict linear data. </p>

<p style= "text-indent: 25px;"> The feature importance table above shows us the top 5 important variables used by RF to making the prediction. We see here that population is the most important factor by far, followed by the ```date```, ```median age per country```, ```minumum daily temperature```, and ```density per country``` </p>

<b>Result:</b> Climate importance in prediction cannot be seen through the data in this format.



### Prediction Take 2: Prediction With Lag Time 
<p style= "text-indent: 25px;"> <u>Something that we did not consider when trying to compare climate determinants and cases was the lag in response to the climate determinants.</u> Many countries, including the U.S., test people when they start showing signs of the virus such as dry coughs and high fevers. These symptoms kick in, on average, in around 6-7 days. As a result, we can assume that the people that had caught the virus due to a speicfic climate condition would not show signs of contagion until around a week has passed. As a result, we thought to run the predictive model with this lag in mind.  </p>

### Predictor #1: Linear Regression With Lag 


In [155]:
data1 = weather_country.copy()
# Apply log transformation to all ConfirmedCases and Fatalities columns, except for trends
data1['lag']  = data1['lag'].astype('float64')
data1['lag_death'] = data1['lag_death'].astype('float64')
data1['lag'] = data1['lag'].apply(lambda x: np.log1p(x))
data1['lag_death'] = data1['lag_death'].apply(lambda x: np.log1p(x))

# Replace infinites
data1.replace([np.inf, -np.inf], 0, inplace=True)
data1 = data1.fillna(0)
features1 = ['Country_Region','Province_State','lag','Population','wdsp','prcp','fog','temp', 'rh', 'min', 'max','temp variance','Density P/km^2',
       'day_from_jan_first','Med. Age','Urban Pop %' ]
data1 = data1[features1]


#data1.sort_values(by=['day_from_jan_first'])
train1 = data1[data1['day_from_jan_first'] <=104]
test1 = data1[data1['day_from_jan_first'] > 105]
test1 = test1[test1['day_from_jan_first'] < 120]
trainy1 = train1[['lag']].copy()
trainx1 = train1.drop(columns = ['Country_Region', 'Province_State','lag']).copy()
testy1 = test1[['lag']].copy()
testx1 = test1.drop(columns = ['Country_Region', 'Province_State','lag']).copy()

In [156]:
data1.sample(10)
param = 0
regr = linear_model.LinearRegression()
# Linear regression model
def lin_reg(X_train, Y_train, X_test):
    # Create linear regression object
    #regr = linear_model.LinearRegression()

    # Train the model using the training sets
    regr.fit(X_train, Y_train)
    param = regr.get_params()

    # Make predictions using the testing set
    y_pred = regr.predict(X_test)
    
    return regr, y_pred
r1, ypred1= lin_reg(trainx1, trainy1, testx1)

r21 = r2_score(testy1, ypred1)
mse1 = mean_absolute_error(testy1, ypred1)
#msle = mean_squared_log_error(testy, ypred)
print("R squared Value")
print(r21)
print("     ")
print("Mean Absolute Error Value")
print(mse1)
#print("Mean Squared Log Error Value")
#print(msle)

R squared Value
-0.08098897009679362
     
Mean Absolute Error Value
2.155050568489572


In [157]:
k = test1.copy()


k['predicted'] = ypred1
k['real'] = testy1
d = k.groupby(['day_from_jan_first'],as_index=False)['predicted', 'real'].sum()
#s.head(10)

fig = go.Figure()
fig.add_trace(go.Scatter(x=d['day_from_jan_first'], y=d['real'],
                    mode='lines',
                    name='Real Confirmed Cases'))
fig.add_trace(go.Scatter(x=d['day_from_jan_first'], y=d['predicted'],
                    mode='lines',
                    name='Predicted Confirmed Cases' ))

fig.update_layout(title='Regression Predictive Model WITH 7 day Lag')


fig.show()


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



In [158]:
rfcla1 = RandomForestRegressor(n_estimators=100, max_samples=0.8,
                        random_state=1)
# We train model
rfcla1.fit(trainx1, trainy1)
predictions1 = rfcla1.predict(testx1)
#roc_value = roc_auc_score(testy, predictions)

fi1 = pd.DataFrame({'feature': list(trainx1.columns),
                   'importance': rfcla1.feature_importances_}).\
                    sort_values('importance', ascending = False)
fi1.head()

r2 = r2_score(testy1, predictions1)
print("R squared Value (Variability Explained)")
print(r2)
print(' ')
print("Mean Squared Error Value (Variance):")
mse = mean_squared_error(testy1, predictions1)
print(mse)
fi.head()


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



R squared Value (Variability Explained)
0.7216407764397945
 
Mean Squared Error Value (Variance):
1.8123094986725587


Unnamed: 0,feature,importance
10,day_from_jan_first,0.548336
0,Population,0.194605
11,Med. Age,0.064649
4,temp,0.059219
9,Density P/km^2,0.030074


In [159]:
k['rf'] = predictions1
c = k.groupby(['day_from_jan_first'],as_index=False)['real', 'rf'].sum()
fig = go.Figure()
fig.add_trace(go.Scatter(x=c['day_from_jan_first'], y=c['real'],
                    mode='lines',
                    name='Real Confirmed Cases'))
fig.add_trace(go.Scatter(x=c['day_from_jan_first'], y=c['rf'],
                    mode='lines',
                    name='Predicted Confirmed Cases' ))
fig.update_layout(title='Random Forest Predictive Model WITH 7 day Lag')




fig.show()



Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



### Predictive Models Conclusion With Lag: 

<p style= "text-indent: 25px;"> When looking at the linear regression model, we see a R^2 value of ```0.53``` which means that half of the variability can be explained in the regression line and the errors between the real value and predicted is a lot lower than without lag. We can see that the model does significantly better for linear regression when accounting for a 7-day lag. In addtition, when we perform a Random Forest ensemble, we see an even more drastic increase in improvement in correlation between the variables and the results. The R^2 value for the Random Forest Regressor (RF) is ```0.93``` which says that 93% of the variability in the data can be explained in this random forest model. Again, the difference in the two models can be due to the fact that RF is more robust to outliers, and it can predict non-linear data while linear regression can only predict linear data. </p>

<p style= "text-indent: 25px;"> The feature importance table above shows us the top 5 important variables used by RF to making the prediction. We see that the important variables do not change much with lag. </p>

<b>Result:</b> Predictive power in models improve significantly with lag-time.

## Conclusions and Policy Implementation:


In [154]:
fig = px.scatter_3d(new, x='rh', y='temp', z='lag',
              color='Country_Group', log_z=True)
fig.update_layout(title='4/3-4/11 Average Rate of New Cases With Lag/ Temp (C) / Relative Humidity')
fig.update_layout(
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="#7f7f7f"
    ),
    scene = dict(
                    xaxis_title='Rel. Humidity',
                    yaxis_title='Temp. (C)',
                    zaxis_title='Cases Rate w/ Lag')
)

fig.show()

<p style= "text-indent: 25px;"> <b>The data suggests that cooler, slightly humid regions promote more rapid case spread at the subcontinent scale. </b> This study would benefit from additionally analyzing trends in human susceptibility to the virus as well as the seasonality of the virus, should that data become available. </p>

<p style= "text-indent: 25px;"> Additionally, average temperatures were used instead of daily variances in temperature. Susceptibility to the virus could also be a result of exposure to a range of temperatures in a given day. </p>

<p style= "text-indent: 25px;"> Going forward, policymakers should note these temperature and relative humidity values as nations start to relax social distancing guidelines. The Spring and Fall temperate seasons are likely to see higher rates of case spread. A humid, but cool, Summer evening might enable faster case spread than a hot, humid Summer afternoon. Understanding broad trends for when human mobility is relatively safe will aid both policymakers and the general public in combatting the pandemic. </p>

## References 


<a href ><img src="https://i.ibb.co/mTtqxP5/Screenshot-2020-05-05-20-33-31.png" alt="Screenshot-2020-05-05-20-33-31" border="0"></a>
<a href><img src="https://i.ibb.co/0BNcnGT/Screenshot-2020-05-05-20-33-54.png" alt="Screenshot-2020-05-05-20-33-54" border="0"></a>
<a href><img src="https://i.ibb.co/J7TTmsx/Screenshot-2020-05-05-20-34-18.png" alt="Screenshot-2020-05-05-20-34-18" border="0"></a>