# Basic Info:
The Project Title is: The Epidemiology of COVID-19

Group member 1 
name: Austin Hickey
e-mail: U1041943@utah.edu
UID: U1041943

Group member 2
name: Spencer Sawas
e-mail: spencer.sawas@utah.edu
UID: U1065866

Group member 3
name: Marko Miholjcic
e-mail: u0984549@utah.edu
UID: u0984549

# Background and Motivation


Currently the world is facing a global pandemic with a new virus that has evolved and crossed over to humans. The virus known as COVID-19 affects the respiratory and cardiovascular systems by binding to the Ace-2 receptors. Ace-2 receptors are found throughout the cardiovascular and respiratory system. For this reason, the virus is extremely dangerous for susceptible populations. Currently, 30% of Americans suffer from cardiovascular diseases, 10% from diabetes, and 10% from asthma; these populations do not account for those who are immunocompromised or more susceptible for other reasons. 
The virus causes violent coughing, restricted breathing, inflammation, and cardiovascular hypertension (among other cardiovascular effects). Furthermore, it can cause pneumonia. Pneumonia can be fatal; especially if left untreated. 
Due to how transmissible the virus is, hospitals and healthcare workers across the world have been put to the test. The rapid spread of the virus has caused many hospitals to become overloaded with patients, with limited resources available. 

To try and curb the spread of the virus, countries across the world are temporarily shutting down and government officials have been recommended to social quarentine. Ripple effects have been detrimental to the economy. Millions of people have lost their jobs. People are dipping into their savings to pay rent, while others are unable to pay rent. Some economists have speculated an economic recession after the virus passes.  

# Project Objectives ***

Understand the severity of the global pandemic and predict the effects to come in the near future for states in the United States and countries around the world. Using predictive modeling techniques such as logistic regression, we will identify how long until the number of cases begins to plateau, how many cases there will be when it begins to plateau, and a prediction of the number of deaths a location will experience. 

We will use the predictive models to create plots to visualize the severity of the virus in the locations analyzed. Further, we will explore which states will be most heavily impacted. A geospatial map will be created to plot the spread of COVID-19 in the United States and identify the hospital beds per 1,000 for each state.

The benefits? 
What would you like to learn and accomplish?

These variables will provide an oppurtunity to perform a clustering analysis to determine if they are any realtionships between variables and the cases in a state. 

# Data


We will be using several sources of data in order to accurately represent the COVID-19 outbreaks and information relevant to analyzing contributing factors to the outbreak.

For the COVID-19 data we will use multiple APIs of a github repository to collect the number of cases, recoveries, and deaths over time for a number of countries. 
This data is a collection of data put together in csv format by John Hopkins Center for Systems Science and Engineering. Data extracted will be between the dates of January 22, 2020 and April 1, 2020. The github webpage is:
https://github.com/CSSEGISandData/COVID-19.

Below a single url is retrieved and placed into a dataframe in order to display the data that is imported from the API. Each day, the github repository updates the cumulative number of cases, recoveries, and deaths. The data also provides the province/state and country/region where the cases are occuring. 

In [1]:
import requests

url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv'
response = requests.get(url)
response

<Response [200]>

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv(url)
df

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
4,Gansu,Mainland China,1/22/2020 17:00,,,
5,Guangdong,Mainland China,1/22/2020 17:00,26.0,,
6,Guangxi,Mainland China,1/22/2020 17:00,2.0,,
7,Guizhou,Mainland China,1/22/2020 17:00,1.0,,
8,Hainan,Mainland China,1/22/2020 17:00,4.0,,
9,Hebei,Mainland China,1/22/2020 17:00,1.0,,


For our regression model to predict the number of deaths from COVID-19 that a particular country will experience we will be extracting data from the WHO Healthcare index pdf. A countries healthcare index is a good indicator of how adept their healthcare infrastructure is, and how healthy a country is overall. The WHO healthcare index takes into account myriad of variables to rank every nations healthcare system.
https://www.who.int/healthinfo/paper30.pdf

In [4]:
import pdfplumber
import pandas as pd
with pdfplumber.open(r"C:\Users\austi\Documents\GitHub\COMP5360Project\project_files\pdf_resources\world_health_index.pdf") as pdf:
    pages = pdf.pages[17]
    table = pages.extract_table()
    pages2 = pdf.pages[18]
    table2 = pages2.extract_table()
    pages3 = pdf.pages[19]
    table3 = pages3.extract_table()
    pages4 = pdf.pages[20]
    table4 = pages4.extract_table()
    
df = pd.DataFrame(table[1:], columns=table[1])
df.head()
df=df[['Rank','Member State', 'Index']]

data = pd.DataFrame(table2[1:], columns=table2[1])
data=data[['55','Albania', '0.774']]
data=data.rename(columns={'55':'Rank','Albania':'Member State', '0.774':'Index'})                
df=df.append(data,ignore_index=True)

data = pd.DataFrame(table3[1:], columns=table3[1])
data=data[['117','Uzbekistan', '0.599']]
data=data.rename(columns={'117':'Rank','Uzbekistan':'Member State', '0.599':'Index'})                 
df=df.append(data,ignore_index=True)

data = pd.DataFrame(table4[1:], columns=table4[1])
data=data[['178','Chad', '0.303']]
data=data.rename(columns={'178':'Rank','Chad':'Member State', '0.303':'Index'})                  
df=df.append(data,ignore_index=True)

for column in ["Rank", "Member State", 'Index']:
    df[column] = df[column].str.replace(" ", "")
    df[column] = df[column].replace('\n','', regex=True)
df=df.dropna()
df=df.drop([0])
df.set_index('Rank')
df['Rank']=df['Rank'].astype(int)
df['Index']=df['Index'].astype(float)
df['Member State']=df['Member State'].astype(str)
df.dtypes

ModuleNotFoundError: No module named 'pdfplumber'

In [5]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df)
    
df.to_csv(r'C:\Users\austi\Documents\GitHub\COMP5360Project\project_files\csv_files\CLEAN_WorldHealthIndex.csv', index=False)

    Province/State  Country/Region      Last Update  Confirmed  Deaths  \
0            Anhui  Mainland China  1/22/2020 17:00        1.0     NaN   
1          Beijing  Mainland China  1/22/2020 17:00       14.0     NaN   
2        Chongqing  Mainland China  1/22/2020 17:00        6.0     NaN   
3           Fujian  Mainland China  1/22/2020 17:00        1.0     NaN   
4            Gansu  Mainland China  1/22/2020 17:00        NaN     NaN   
5        Guangdong  Mainland China  1/22/2020 17:00       26.0     NaN   
6          Guangxi  Mainland China  1/22/2020 17:00        2.0     NaN   
7          Guizhou  Mainland China  1/22/2020 17:00        1.0     NaN   
8           Hainan  Mainland China  1/22/2020 17:00        4.0     NaN   
9            Hebei  Mainland China  1/22/2020 17:00        1.0     NaN   
10    Heilongjiang  Mainland China  1/22/2020 17:00        NaN     NaN   
11           Henan  Mainland China  1/22/2020 17:00        5.0     NaN   
12       Hong Kong       Hong Kong  1/

The population and population densities are collected from an csv file downloaded from the following website: http://worldpopulationreview.com/. We gathered two csv files from the website. One csv file had data from the states while the other csv file had data for all of the countries in the world. ******* States needed

In [6]:
import pandas as pd

In [7]:
df = pd.read_csv("/Users/markomiholjcic/Documents/GitHub/COMP5360Project/project_files/csv_files/population_and_density_by_country.csv")
df

Unnamed: 0,Rank,name,pop2019,pop2018,GrowthRate,area,Density
0,1,China,1433783.686,,1.0039,9706961.00,147.7068
1,2,India,1366417.754,,1.0099,3287590.00,415.6290
2,3,United States,329064.917,,1.0059,9372610.00,35.1092
3,4,Indonesia,270625.568,,1.0107,1904569.00,142.0928
4,5,Pakistan,216565.318,,1.0200,881912.00,245.5634
5,6,Brazil,211049.527,,1.0072,8515767.00,24.7834
6,7,Nigeria,200963.599,,1.0258,923768.00,217.5477
7,8,Bangladesh,163046.161,,1.0101,147570.00,1104.8734
8,9,Russia,145872.256,,1.0004,17098242.00,8.5314
9,10,Mexico,127575.529,,1.0106,1964375.00,64.9446


Lastly, on a national level we will consider additional variables such as state populations/density and the size of the existing healthcare systems for each state (number of hospital beds per 1,000 people). The following website was utilized: https://www.kff.org/other/state-indicator/beds-by-ownership/?currentTimeframe=0&selectedDistributions=statelocal-government&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D

In [1]:
import pandas as pd
import requests

In [2]:
from bs4 import BeautifulSoup

url = 'https://www.kff.org/other/state-indicator/beds-by-ownership/?currentTimeframe=0&selectedDistributions=statelocal-government&print=true&sortModel=%7B%22colId%22:%22State%2FLocal%20Government%22,%22sort%22:%22asc%22%7D'
response = requests.get(url)
response

<Response [200]>

In [3]:
file = open("/Users/markomiholjcic/Documents/GitHub/COMP5360Project/project_files/pdf_resources/HospitalBed.txt", "w")
file.write(response.text)
file = open("/Users/markomiholjcic/Documents/GitHub/COMP5360Project/project_files/pdf_resources/HospitalBed.txt", "r")
content = file.readlines()
content

['<!DOCTYPE html>\n',
 '<!--[if lt IE 7]> <html lang="en-us" class="lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n',
 '<!--[if IE 7]>    <html lang="en-us" class="lt-ie9 lt-ie8"> <![endif]-->\n',
 '<!--[if IE 8]>    <html lang="en-us" class="lt-ie9"> <![endif]-->\n',
 '<!--[if gt IE 8]><!--> <html lang="en-US"  class="responsive"> <!--<![endif]-->\n',
 '<head>\n',
 '\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n',
 '<meta charset="utf-8" /><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"1fa8b5b91a",applicationID:"211788170"};window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var i=n[t]={exports:{}};e[t][0].call(i.exports,function(n){var i=e[t][1][n];return r(i||n)},i,i.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(e,n,t){function r(){}function i(e,n,t){return function(){return o(e,[u.now()].concat(f(arguments)),n?nu

Furthermore, we will use data gathered from each country in the United States. This excel file will provide us with a plethora of variables to exlpore, such as the number of physicians per county for each state. The link to the website where the excel document was downloaded from is: https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation

# Ethical considerations


If this were published, and publicly visible; we would not want our project to induce mass hysteria for the states that have been identified to be most severely impacted. It would also be important that our project does not identify a state that is least impacted- and influence them to go against governmental restrictions that were put in place.

As we are working with an ongoing crisis we are using the data while also upholding the real concequences this data is having on hundreds of thousands of lives in America and across the world. Our use of this data is not meant to be insensitive but instead to try and highlight exactly how extreme this can become. 

# Data Processing

All data processing will be done within jupyter notebook running python software. The API, CSV, and data obtained from web scraping will all need to be loaded into a jupyter notebook and will converted from JSON, dictionary, and/or lists into a pandas data frame for processing and analysis.

With this we will focus in on a select number of countries to focus the data frame and the data relating to the number of outbreaks and the number of deaths. 

Latitude and longitude data for the outbreaks in the United States will be utilized to create visual geospatial plot of the outbreaks in the US for the predicted and actual number of cases. 

Upon cleaning the data the formatting of the date and province/date columns had to be modifyed as the 'date' and 'province' strings were in different formats from the GitHub repository. 