## EDA & Visualization using COVID-19 Dataset

- COVID-19 Dataset
    - https://github.com/CSSEGISandData/COVID-19
    - Source: Opensource from Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
- Visualization
    - https://app.flourish.studio/login

![COVID19Url](https://miro.medium.com/max/1400/1*98aCl7DGaFaKJK191z0dLw.gif)

- We need 3 Data
    - Country Name
    - Flag image
    - Confirmed cases 



<table>
<thead>
<tr>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
<th>G</th>
<th>H</th>
</tr>
</thead>
<tbody>
<tr>
<th>Country_Region</th>
<th>Image URL</th>
<th>1/22/20</th>
<th>1/23/20</th>
<th>1/24/20</th>
<th>1/25/20</th>
<th>1/26/20</th>
</tr>
<tr>
<th>Afghanistan</th>
<th>https://www.countryflags.io/AO/flat/64.png</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
<tr>
<th>Albania</th>
<th>https://www.countryflags.io/BI/flat/64.png</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
<tr>
<th>Algeria</th>
<th>https://www.countryflags.io/BJ/flat/64.png</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
</tbody>
</table>

<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">1. Explore Raw Dataset</font><br>
</div>

- Understand Dataset
    - Different column names

In [1]:
import pandas as pd

- Example. April 1 Dataset
    - c.f., Country_Region

In [2]:
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
doc = pd.read_csv(PATH + "04-01-2020.csv", encoding='utf-8-sig')

In [3]:
doc.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"


- Example. March 1 Dataset
    - Country/Region

In [4]:
doc2 = pd.read_csv(PATH + "03-01-2020.csv", encoding='utf-8-sig')
doc2.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
0,Hubei,Mainland China,2020-03-01T10:13:19,66907,2761,31536,30.9756,112.2707
1,,South Korea,2020-03-01T23:43:03,3736,17,30,36.0,128.0
2,,Italy,2020-03-01T23:23:02,1694,34,83,43.0,12.0
3,Guangdong,Mainland China,2020-03-01T14:13:18,1349,7,1016,23.3417,113.4244
4,Henan,Mainland China,2020-03-01T14:13:18,1272,22,1198,33.882,113.614


<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">2. Pre-processing the Data</font><br>
</div>

### 2-1. How to Clean Messy Pandas Column Names
>- e.g., Country_Region vs Country/Region
>- <b>try</b> and <b>except</b>

In [5]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
doc.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
4,Gansu,Mainland China,1/22/2020 17:00,,,


In [7]:
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
try:
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']

doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1.0
1,Beijing,Mainland China,14.0
2,Chongqing,Mainland China,6.0
3,Fujian,Mainland China,1.0
4,Gansu,Mainland China,


### 2-2. Delete Missing Values and Change Datatype
>- delete missing values: df.dropna(subset=['column'])
>- change datatype: df.astype({'column': 'datatype'})

In [8]:
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
try:
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']

doc = doc.dropna(subset=['Confirmed']) #Delete rows with missing values on 'Confirmed'
doc = doc.astype({'Confirmed': 'int64'})

doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26
