### 1. Dataset Representation

- About the Dataset<br>

The data was provided by Our World in Data (OWID). The file contains different data values that could help paint a better image of a country’s status for COVID-19. The version used in this project will be the July 15, 2021 release of the dataset, however OWID attempts both daily and weekly update of data whenever possible, thus ensuring that the data they provide is the latest possible.

- Collection Process and its Implications<br>

The collection was done by the Our World in Data Group which is a research group that focuses on research and aggregation of data in a single accessible repository for the purposes of getting a better picture or even solving world problems that can benefit all of mankind. For the specific dataset, they made use of all possible available data that is publicly released by governments of all nations in the world. According to OWID, the data was collected from the following sources which include:
    
    1. COVID-19 Data Repository of Johns Hopkins University
    2. National Government Reports
    3. Oxford COVID-19 Government Response Tracker, Blavatnik School of Government
    4. United Nations Data (for demographics related data)
    5. World Bank Data (for demographics related data)
    
The data implies that the data presented assumes to be the latest data possible, with its validity ultimately depending on each government's transparency and accuracy with the data they are reporting publicly and to John Hopkins University.
    <br>
- Structure of Dataset of the File<br>

    The dataset's structure consists of 102,475 observations with 60 variables available. The structure goes on every country's date when it reported either its first COVID-19 case or first COVID-19 test. The dataset was already distributed publicly on a single file containing all of the relevant information possible. There is however other datasets which contain specific and specialized versions of the current dataset we are using that is also available for use on OWID's Github repository.
    
    The list of locations are a mixture of contients and actual countries, as recognized by OWID, which may or may not be legally recognized by the international community.
    
    <br>
- About the Variables<br>
    
    The dataset has 60 variables, most of which relate to COVID-19 related numbers such as cases, deaths, recoveries, vaccinations among others, as well as demographic data such as GDP per capita, HDI, median age, population, population density among others.

In [1]:
print("LOADING LIBRARIES...")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import ttest_ind

bar = "================================="
automated = True #Manual entry or pre-defined entries
print("AUTOMATED MODE:",automated)

#Code for data preparation
#PREPARE FILES AND RAW DATAFRAME
raw_df = None
if(not automated):
    filename = input("Enter Filename of CSV file (including .csv): ")
    raw_df = pd.read_csv(filename)
else:
    raw_df = pd.read_csv("COVID_7_15.csv")
#Raw file reading: make use of covid_df.readline() to retrieve a str line (as str) from

print("Raw Dataframe Shape:", raw_df.shape,"\n",bar)
print(raw_df.info())
print("Location List:",raw_df["location"].unique())

LOADING LIBRARIES...
AUTOMATED MODE: True
Raw Dataframe Shape: (102475, 60) 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102475 entries, 0 to 102474
Data columns (total 60 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   iso_code                               102475 non-null  object 
 1   continent                              97689 non-null   object 
 2   location                               102475 non-null  object 
 3   date                                   102475 non-null  object 
 4   total_cases                            98594 non-null   float64
 5   new_cases                              98591 non-null   float64
 6   new_cases_smoothed                     97581 non-null   float64
 7   total_deaths                           88371 non-null   float64
 8   new_deaths                             88527 non-null   float64
 9   new_deaths_smoothed                    97581 non

### 2. Data Cleaning

Given that there are a lot of nations and variables to consider, it has been decided to reduce to scope of nations to just the ASEAN nations as well as the World as a baseline. The consideration for ASEAN nations was made because of the following reasons:

1. Near proximity
2. Economic integration
3. Similar level economies and populations

This could help us determine the COVID-19 status of the Philippines to its neighbors as well as the World if ever it is applicable.

The most of the columns are to be ommitted since it contains pre-treated values, specialized values, or varying values (in terms of the unit of measurement).

Some of the columns retained are:
- 'total_cases'
- 'new_cases'
- 'total_deaths'
- 'new_deaths'
- 'total_vaccinations'
- 'people_vaccinated'
- 'people_fully_vaccinated'
- 'new_vaccinations'
- 'stringency_index'
- 'population'
- 'gdp_per_capita'

The script below crunches the raw data and produces a covid_df containing:
1. World COVID-19 Data (from OWID)
2. ASEAN COVID-19 Data (Containing 10 Countries, including the Philippines)
3. Philippine COVID-19 Data

In [2]:
#CSMODEL: COVID-19 Dataset
#Crunches data of selected countries to a grouped one
#Originally came from a separate 

print("LOADING ADDITIONAL LIBRARIES...")
import re

#GLOBAL VARIABLES
checkpoint = True
NaN = float("nan")
group_pop = 0 #Placeholder for the population of group of nations specified.

#CUSTOM FUNCTIONS
def sortbydate(df): #Sorts and returns a given DataFrame on the 'date' column using MergeSort.
    date_values = df['date'].unique()
    date_values = np.sort(date_values,kind='mergesort')
    return date_values
def fillZeros(size): #Returns a list of zeros from a specified size
    return np.zeros(size).tolist()
def writeCheckpoint(df, filename): #Writes a given DataFrame to a CSV file
    if(checkpoint):
        print("WRITING CHECKPOINT...")
        df.to_csv(filename+".csv",index=False)
        print("Checkpoint Complete:",filename)
def aggregator(src_df,iso_code,continent,location,count): #Aggregates the given DataFrame to a grouped version
    tmp_df = pd.DataFrame(columns=toRetain) 
    for i in range(dateCount):
        sp_date = date_values[i] #Specified date
        filtered_df = src_df[src_df['date']==sp_date] #Series of nations with specified date
        observations = filtered_df.shape[0]
        if(observations == count): #Will run only if all countries listed are there
            id = [iso_code,continent,location,sp_date] #Default identifiers for ASEAN
            data = fillZeros(len(toRetainData))
            for j in range(observations):
                #add current data with the retrieved data
                retrieve = filtered_df[toRetainData].iloc[j].tolist()
                #print(retrieve)
                data = list(map(lambda x,y:x+y,retrieve,data))
            #Make values in average if they are based on trends (Keyword: new, per_xxxx)
            #0-3 = iso_code,continent,location,date; equated to id
            data[1] = data[1]/observations #new cases
            data[3] = data[3]/observations #new deaths
            data[6] = data[6]/observations #new_vaccinations
            data[8] = data[8]/observations #stringency_index
            data[9] = group_pop #population
            data[10] = data[10]/observations #gdp_per_capita
            result = id+data
            tmp_df.loc[tmp_df.shape[0]] = result #"ADDS" THE RESULTING LIST AT THE END OF THE DATAFRAME
    return tmp_df
def dateRange(df): #Finds the lowest and highest date recorded.
    date_values = df['date'].unique()
    date_values = np.sort(date_values,kind='mergesort')
    dateCount = date_values.size
    return [date_values[0], date_values[len(date_values)-1]] #the latest possible data maybe incomplete thus the day prior the latest will be used

#PREPARE FILES AND RAW DATAFRAME
covid_df = raw_df.copy(deep=True)
#Raw file reading: make use of covid_df.readline() to retrieve a str line (as str) from

#DATE SORTING AND VALUES
date_values = sortbydate(covid_df)
dateCount = date_values.size

#COLUMNS TO RETAIN
toRetain = ['iso_code','continent','location','date','total_cases','new_cases','total_deaths','new_deaths','total_vaccinations','people_vaccinated','people_fully_vaccinated','new_vaccinations','stringency_index',
            'population','gdp_per_capita']
toRetainData = toRetain[4:]
identifiers = toRetain[0:4]
#LIST OF ONLY DATA THAT CAN BE USED IN A COLLECTIVE MANNER (AS USED BY OWID ITSELF)
forCollective = ['total_cases','new_cases','new_cases_smoothed','total_deaths','new_deaths','new_deaths_smoothed','total_cases_per_million'
                ,'new_cases_per_million','new_cases_smoothed_per_million','total_deaths_per_million','new_deaths_per_million','new_deaths_smoothed_per_million','total_vaccinations'
                ,'people_vaccinated','people_fully_vaccinated','new_vaccinations','new_vaccinations_smoothed','total_vaccinations_per_hundred','people_vaccinated_per_hundred'
                ,'people_fully_vaccinated_per_hundred','new_vaccinations_smoothed_per_million','population']
targetCountries = ['PHL','BRN','KHM','IDN','SGP','LAO','THA','MYS','MMR','VNM'] #CHANGE CHOICES FOR TARGET COUNTRIES TO GROUP

#COLUMN FLAGS
raw_cols = covid_df.columns.tolist() #ALL COLUMNS AVAILABLE
raw_dataCol = list(set(raw_cols)-set(identifiers)) #DATA ONLY COLUMNS

#DROP COLUMNS
print("DROPPING COLUMNS...")
toDrop = identifiers.copy()
toDrop = list(set(covid_df.columns.tolist()) - set(toRetain))
covid_df = covid_df.drop(columns=toDrop)

#FILTERING COUNTRIES
print("FILTERING COUNTRIES...")
ph_df = covid_df[covid_df['iso_code']=='PHL'] #PH ONLY
world_df = covid_df[covid_df['iso_code'].str.contains('OWID_WRL')] #OVERALL WORLD DATA BY OWID
covid_df = covid_df[covid_df['iso_code'].str.contains(re.compile('|'.join(targetCountries)),regex=True)] #ASEAN NATIONS; YOU CAN CHANGE LIST OF COUNTRIES TO FOCUS

#FIND TOTAL POPULATION OF ASEAN
pop = covid_df[covid_df['date']==dateRange(covid_df)[1]]
if(pop.shape[0] != len(targetCountries)): #REFERENCES TO targetCountries
    print("COUNTRIES!=",len(targetCountries),"AT MAX DATE!")
    exit()
group_pop = pop['population'].sum()

#DATA CLEANUP: NaN->0
print("DATA CLEANUP (NaN->0)...")
for i in range(0,len(toRetain),1):
    covid_df.loc[covid_df[toRetain[i]].isnull(),toRetain[i]]=0

#READING CONENTS OF EACH OBSERVATION AVAILABLE OF ALL COUNTRIES AVAILABLE ON A GIVEN DATE 
#NOT THE MOST EFFICIENT ALGO AS IT RUNS AT O(n*m)
#WILL MAKE USE OF THE CURRENT LIST OF COUNTRIES AVAILABLE AT covid_df.
print("AGGREGATING ASEAN COUNTRIES...")
group_df = aggregator(covid_df,"MDL_SEA",NaN,"Asia",len(targetCountries)) #Will hold the resulting aggregation of ASEAN countries

#ASEAN Checkpoint
writeCheckpoint(group_df,"asean_checkpoint")

#COMBINING ALL SUBDATAFRAMES TO covid_df
print("COMBINING DATAFRAMES...")
covid_df = pd.concat([group_df, covid_df, world_df])
sortbydate(covid_df) #resort by date

print("Remaining iso_codes:",covid_df['iso_code'].unique())
print("FILE PROCESSING COMPLETE")

LOADING ADDITIONAL LIBRARIES...
DROPPING COLUMNS...
FILTERING COUNTRIES...
DATA CLEANUP (NaN->0)...
AGGREGATING ASEAN COUNTRIES...
WRITING CHECKPOINT...
Checkpoint Complete: asean_checkpoint
COMBINING DATAFRAMES...
Remaining iso_codes: ['MDL_SEA' 'BRN' 'KHM' 'IDN' 'LAO' 'MYS' 'MMR' 'PHL' 'SGP' 'THA' 'VNM'
 'OWID_WRL']
FILE PROCESSING COMPLETE


**The following dataframes could be used for the proceeding code:**
- covid_df = Combined group_df, world_df, ph_df
- group_df = ASEAN countries
- world_df = Overall World COVID data
- ph_df = Philippine COVID data

### 3. Exploratory Data Analysis

EDA Questions<br>
1. Do case trends increase/decrease on every listed countries by month?
2. Is there a correlation between the GDP per capita to hospital and ICU patients of a country?
3. Do case numbers correlate negatively with the number of people being vaccinated?

**Numerical Summaries**

In [3]:
#Code for creating days by months
def createMonthList(start_year, month_count):
    month_list = [] #make use of this
    YM = [start_year,0] #[YEAR,MONTH]
    for i in range(month_count):
        if(YM[1]==12):
            YM[0] = YM[0] + 1
            YM[1] = 1
        else:
            YM[0] = YM[0] + 1
        val = str(year)+'-'+str(mo)+'-'+'01'
        month_list.append(val)
    return month_list

byMonth_df = 

SyntaxError: invalid syntax (<ipython-input-3-4a15e9b0e3c6>, line 15)

In [None]:
#Code for numerical summaries

**Visualizations**

In [None]:
#Code for visualizations

### 4. Research Question

1. Is there a significant difference between ASEAN member nations in total and new case numbers?<br><br>
    1. Scope in Dataset: Total cases and/or New cases
    2. Significance: This is in order to know how the Philippines fare against COVID-19 in comparison to our neighboring countries in the ASEAN as well as in the world.

2. Is the government meeting its half-way goal of vaccinating a significant number of people?<br><br>

### 5. Statistical Inference

**Hyptothesis**<br><br>

$H_0=$ 
<br>
$H_A=$ 
<br>

In [None]:
#Code for formulating statistical inference and hypothesis testing

### 6. Insights and Conclusions

{CONENT}