# Analysing COVID-19 Data

# Introduction

In this portfolio, we are going to explore the global number of COVID-19.

Main Goal
* Comparing some countries based on their total confirmed COVID-19 cases
 - This is to see the total number number of confirmed cases in each countries.
* Comparing some countries after normalisaion
 - Larger countries have significantly larger populations compared to smaller countries; this factor is being considered in the calculation of confirmed cases per every 10 people.
* Build a model for prediction
  - ARIMA (statistical method for prediction based on historical time series) and LSTM (neural network for time series data prediction).



In [None]:
!pip install pmdarima



In [None]:
## importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import torch
from tqdm import tqdm
from pylab import rcParams
import matplotlib.dates as mdates
from matplotlib import rc
from sklearn.preprocessing import MinMaxScaler
from pandas.plotting import register_matplotlib_converters
from torch import nn, optim
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from pandas import datetime
from pmdarima.arima import auto_arima

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#93D30C", "#8F00FF"]

sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))

register_matplotlib_converters()

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

## Dataset for Covid

Johns Hopkins University provides an open dataset that includes data on confirmed cases, deaths, and recovered cases. However, our focus is solely on the confirmed cases. The data is structured with one row per region, featuring columns for Latitude and Longitude, followed by separate columns for each day's data. This dataset has been collected since January 22, 2020, and is continuously updated.


In [None]:
# load the live dataset,
covid_data_url = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
covid = pd.read_csv(covid_data_url)
covid

In [None]:
## group our dataset by country and drop some variables that are not related to our analysis
grouped = covid.groupby('Country/Region').sum()
grouped = grouped.drop(columns=['Lat', 'Long'])
grouped

## Comparing countries

* Comparing some random countries

In [None]:
## Produce a plot for "Australia", "US", "China", "Korea" and "United Kingdom"
aa = grouped.loc[["Australia","US","China","Korea, South", "United Kingdom"]]
aa.T.plot()

* Top 10 countries

In [None]:
## Find top 10 countries
top = grouped.iloc[:,-1]
topp = top.nlargest(10)
display(topp)

In [None]:
## Plot for top 10 countries
toppp = grouped.loc[["US","India","France", "Germany", "Brazil", "Japan", "Korea, South", "Italy","United Kingdom", "Russia"]]
toppp.T.plot()

* Least 10 Countries

In [None]:
## Find least 10 countries and make a plot
small = grouped.iloc[:,-1]
smalll = small.nsmallest(10)
display(smalll)

In [None]:
## Plot for least 10 countries
smallll = grouped.loc[["Korea, North","MS Zaandam","Antarctica", "Holy See", "Diamond Princess", "Tuvalu", "Kiribati", "Nauru"]]
smallll.T.plot()

## Dataset for population

We obtained an open dataset from datahub.io containing population figures for various countries. This dataset covers the overall population of each country and spans from 1960 to 2016. In our COVID dataset, we have data for 186 countries, while the population dataset includes information on 263 countries. To ensure consistency, we manually adjusted the country names in our population dataset to match those in our COVID dataset. For instance, 'US' was changed to 'United States' to facilitate data compatibility between the two sets.

In [None]:
## Read population dataset
population = pd.read_csv("population.csv", encoding = "ISO-8859-1")
population.head()

In [None]:
## Get Country name and population in 2016 only
pop = population[['Country','Year_2016']]

## Change column name to merge with our covid dataset
popp = pop.rename(columns={"Country" : "Country/Region"})
poppp = popp.groupby('Country/Region').sum()

## Merge population and covid dataset
npp = pd.merge(poppp,grouped, how = 'inner', left_index=True, right_index=True)
npp.head()

In [None]:
## Calculate confirmed cases per 10
npp = npp.iloc[:,-1]/npp['Year_2016'] * 10
npp

* Top 10 countries

In [None]:
## Find top 10 countries and plot
nptop = npp.nlargest(10)
plt.barh(nptop.index,nptop)

* Least 10 countries

In [None]:
## Find least 10 countries and plot
npsmall = np.nsmallest(10)
plt.barh(npsmall.index,npsmall)

AttributeError: ignored

* We observe that 7 out of every 10 people in San Marino have tested positive for COVID-19. While the US has the highest number of confirmed cases overall, San Marino leads in cases per capita. It's important to note that we cannot definitively conclude whether the US or San Marino has been more severely affected by COVID-19. Nevertheless, this analysis provides a quick insight into the statistics.

# Predictions


### Data pre-processing
The live dataset provides cumulative numbers, but our plan is to calculate new daily cases for predictions in Korea. Afterward, we will split these new daily cases into a training set and a test set for validation. The dataset spans 1,146 days, with 90% of the data allocated to the training set. This means we will train the model using data from January 22, 2020, to December 31, 2022. Then, we will validate the model by comparing actual and predicted data from January 1, 2023, to March 9, 2023.

In [None]:
## Extract the data for Korea
grouped = grouped.loc[["Korea, South"]]
total_cases = grouped.sum(axis=0)
total_cases.index = pd.to_datetime(total_cases.index)
total_cases

In [None]:
## total number of cases in Korea since 2020
ax = plt.gca()
plt.plot(total_cases)
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=7))
plt.title("Total cases in Korea");

* We will subtract the number of cases on the current date from those on the previous date to calculate the daily new cases

In [None]:
daily_cases = total_cases.diff().fillna(total_cases[0]).astype(np.int64)
daily_cases

In [None]:
## plot daily cases
ax = plt.gca()
plt.plot(daily_cases)
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=7))
plt.title("Daily new cases in Korea");

* This plot could provide insight into the peak of COVID-19 spread in Korea.

In [None]:
## split data into train and test set
daily_cases.index = pd.to_datetime(daily_cases.index)
training_set = daily_cases[:'2022']
test_set = daily_cases['2023':]

## ARIMA Model

* Test for Stationarity before building ARIMA

In [None]:
## Autocorrelation Function (ACF) and Partial Autorreclation Function (PACF)
plot_acf(daily_cases)
plot_pacf(daily_cases)
plt.show()

In [None]:
## Augmented Dicky-Fullter Test
result = adfuller(daily_cases)

print('ADF statistics: %f' % result[0])
print('p-value %f' % result[1])
print('Critical values')
for key, value in result[4].items():
  print('\t%s: %3f' % (key,value))

* All ACF, PACF, and ADF tests indicate that our dataset is stationary. ACF values drop off quickly, PACF values approach 0, and the p-value from the ADF test is less than 0.05. These results collectively confirm the stationarity of the time series in this dataset.

In [None]:
## find the optimal p(AR specification), d(Integer order), q(MA specification)
model = auto_arima(daily_cases, start_p=1, start_q=1,
                      test='adf',
                      max_p=5, max_q=5,
                      m=1,
                      d=1,
                      seasonal=False,
                      start_P=0,
                      D=None,
                      trace=True,
                      error_action='ignore',
                      suppress_warnings=True,
                      stepwise=True)

* Fit model


In [None]:
## we fit a model, p(AR specification) = 5, d(Integer order) = 1, q(MA specification) = 4 based on best model selection
model = ARIMA(daily_cases, order=(5,1,4))
model_fit = model.fit()

## set up the date period we want to forecast
start_index = datetime(2023, 1, 1)
end_index = datetime(2023, 3, 9)
forecast = model_fit.predict(start = start_index, end=end_index, typ='levels')

In [None]:
forecast.head(15)

In [None]:
test_set.head(15)

In [None]:
##plot the original data vs predicted data
plt.figure(figsize=(22,8))
plt.plot(test_set, label="original")
plt.plot(forecast,label = "predicted")
plt.title("Covid Forecast")
plt.xlabel("Date")
plt.ylabel("Daily new cases")
plt.legend()
plt.show()

## LSTM (Long Short-Term Memory), Deep Learning

In [None]:

scaler = MinMaxScaler()

scaler = scaler.fit(np.expand_dims(training_set, axis=1))

train_data = scaler.transform(np.expand_dims(training_set, axis=1))

test_data = scaler.transform(np.expand_dims(test_set, axis=1))

In [None]:
def create_sequences(data, seq_length):
    xs = []
    ys = []
    for i in range(len(data)-seq_length):
        x = data.iloc[i:(i+seq_length)]
        y = data.iloc[i+seq_length]
        xs.append(x)
        ys.append(y)
    return np.array(xs), np.array(ys)

In [None]:
seq_length = 5
X, y = create_sequences(daily_cases, seq_length)