## EDA & Visualization using COVID-19 Dataset

- COVID-19 Dataset
    - https://github.com/CSSEGISandData/COVID-19
    - Source: Opensource from Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
- Visualization
    - https://app.flourish.studio/login
    ![COVID19Url](https://miro.medium.com/max/1400/1*98aCl7DGaFaKJK191z0dLw.gif)
   
- We need 3 Data
    - Country Name
    - Flag image
    - Confirmed cases 

<table>
<thead>
<tr>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
<th>G</th>
<th>H</th>
</tr>
</thead>
<tbody>
<tr>
<th>Country_Region</th>
<th>Image URL</th>
<th>1/22/20</th>
<th>1/23/20</th>
<th>1/24/20</th>
<th>1/25/20</th>
<th>1/26/20</th>
</tr>
<tr>
<th>Afghanistan</th>
<th>https://www.countryflags.io/AO/flat/64.png</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
<tr>
<th>Albania</th>
<th>https://www.countryflags.io/BI/flat/64.png</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
<tr>
<th>Algeria</th>
<th>https://www.countryflags.io/BJ/flat/64.png</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
</tbody>
</table>

<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">2. Pre-processing the Data</font><br>
</div>

### 2-1. How to Clean Messy Pandas Column Names
### 2-2. Delete Missing Values and Change Datatype

In [1]:
import pandas as pd

PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
try:
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']

doc = doc.dropna(subset=['Confirmed']) #Delete rows with missing values on 'Confirmed'
doc = doc.astype({'Confirmed': 'int64'})

doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26


### 2-3. Add Country Flags to the Bar Chart 

#### Make Consistent Names for "Country_Region"
>- Create a json file as a new reference for Country Name
    - file name: country_convert.json
>- Apply new country name using apply()

In [2]:
import json

with open("COVID-19-master/csse_covid_19_data/country_convert.json", 'r', encoding='utf-8-sig') as json_file:
    json_data = json.load(json_file)

In [3]:
def func(df_arg):
    if df_arg['Country_Region'] in json_data:
        df_arg['Country_Region'] = json_data[df_arg['Country_Region']]
    return df_arg

doc = doc.apply(func, axis=1)
doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26


### 2-4. Change File Name
>- From: 01-22-2020
>- To: 01/22/2020

In [4]:
date = "01-22-2020.csv"
date_column = date.split('.')[0].lstrip('0').replace('-', '/')
date_column

'1/22/2020'

In [5]:
doc.columns

Index(['Province_State', 'Country_Region', 'Confirmed'], dtype='object')

>- Change 'Confirmed' to 'Date info'

In [7]:
doc.columns = ['Province_State', 'Country_Region', date_column]
doc.columns

Index(['Province_State', 'Country_Region', '1/22/2020'], dtype='object')

In [8]:
doc.head()

Unnamed: 0,Province_State,Country_Region,1/22/2020
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26


### 2-5. Group Dataframe
>- Group by Country_Region
    - e.g., group 'Anhui', 'Beijing', 'Chongqing' into 'China'
>- groupby(column).sum()

In [9]:
student = {
    'Gender': ['M', 'M', 'F', 'F'],
    'Name': ['Ben', 'Will', 'Danielle', 'Melissa'],
    'Math': [85, 80, 90, 65],
    'Literature': [65, 80, 78, 95]        
}

df_test = pd.DataFrame(student)
df_test

Unnamed: 0,Gender,Name,Math,Literature
0,M,Ben,85,65
1,M,Will,80,80
2,F,Danielle,90,78
3,F,Melissa,65,95


In [10]:
df_test.groupby('Gender').mean()

Unnamed: 0_level_0,Math,Literature
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,77.5,86.5
M,82.5,72.5


In [11]:
df_test.groupby('Gender').sum()

Unnamed: 0_level_0,Math,Literature
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,155,173
M,165,145


### <font color='red'>Group Confirmed Cases by Country_Region</font>

In [12]:
doc.head()

Unnamed: 0,Province_State,Country_Region,1/22/2020
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26


In [13]:
doc.groupby('Country_Region').sum()

Unnamed: 0_level_0,1/22/2020
Country_Region,Unnamed: 1_level_1
China,548
Japan,2
"Korea, South",1
Taiwan,1
Thailand,2
US,1
