## EDA & Visualization using COVID-19 Dataset

- COVID-19 Dataset
    - https://github.com/CSSEGISandData/COVID-19
    - Source: Opensource from Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
- Visualization
    - https://app.flourish.studio/login
- We need 3 Data
    - Country Name
    - Flag image
    - Confirmed cases 

<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">2. Pre-processing the Data</font><br>
</div>

### 2-1. How to Clean Messy Pandas Column Names
>- e.g., Country_Region vs Country/Region
>- <b>try</b> and <b>except</b>

### 2-2. Delete Missing Values and Change Datatype
>- delete missing values: df.dropna(subset=['column'])
>- change datatype: df.astype({'column': 'datatype'})

In [1]:
import pandas as pd

PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
try:
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']

doc = doc.dropna(subset=['Confirmed']) #Delete rows with missing values on 'Confirmed'
doc = doc.astype({'Confirmed': 'int64'})

doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26


### 2-3. Add Country Flags to the Bar Chart 
>- ISO2
    - https://www.countryflagsapi.com/

In [3]:
iso2 = pd.read_csv("COVID-19-master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv", encoding='utf-8-sig')
iso2.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key
0,0,0,,BW,,,,,,Botswana,,,Botswana
1,1,1,,BI,,,,,,Burundi,,,Burundi
2,2,2,,SL,,,,,,Sierra Leone,,,Sierra Leone
3,3,3,4.0,AF,AFG,4.0,,,,Afghanistan,33.93911,67.709953,Afghanistan
4,4,4,8.0,AL,ALB,8.0,,,,Albania,41.1533,20.1683,Albania


### Merge Two Dataframes
>- df1: confirmed cases dataframe
    - 01-22-2020.csv
>- df2: iso2 dataframe

In [11]:
doc_iso2 = pd.merge(doc, iso2, how='left', on = 'Country_Region')
doc_iso2.head()

Unnamed: 0.2,Province_State_x,Country_Region,Confirmed,Unnamed: 0,Unnamed: 0.1,UID,iso2,iso3,code3,FIPS,Admin2,Province_State_y,Lat,Long_,Combined_Key
0,Anhui,Mainland China,1,,,,,,,,,,,,
1,Beijing,Mainland China,14,,,,,,,,,,,,
2,Chongqing,Mainland China,6,,,,,,,,,,,,
3,Fujian,Mainland China,1,,,,,,,,,,,,
4,Guangdong,Mainland China,26,,,,,,,,,,,,


>- Check 'Country_Region' that has Null on 'iso2'

In [12]:
doc_iso2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3333 entries, 0 to 3332
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Province_State_x  3330 non-null   object 
 1   Country_Region    3333 non-null   object 
 2   Confirmed         3333 non-null   int64  
 3   Unnamed: 0        3308 non-null   float64
 4   Unnamed: 0.1      3308 non-null   float64
 5   UID               3308 non-null   float64
 6   iso2              3308 non-null   object 
 7   iso3              3308 non-null   object 
 8   code3             3308 non-null   float64
 9   FIPS              3302 non-null   float64
 10  Admin2            3246 non-null   object 
 11  Province_State_y  3305 non-null   object 
 12  Lat               3203 non-null   float64
 13  Long_             3203 non-null   float64
 14  Combined_Key      3308 non-null   object 
dtypes: float64(7), int64(1), object(7)
memory usage: 416.6+ KB


- Subset of Counry Dataframe with Null (missing value) on 'iso2'

In [13]:
iso2_null = doc_iso2[doc_iso2['iso2'].isnull()]
iso2_null.head()

Unnamed: 0.2,Province_State_x,Country_Region,Confirmed,Unnamed: 0,Unnamed: 0.1,UID,iso2,iso3,code3,FIPS,Admin2,Province_State_y,Lat,Long_,Combined_Key
0,Anhui,Mainland China,1,,,,,,,,,,,,
1,Beijing,Mainland China,14,,,,,,,,,,,,
2,Chongqing,Mainland China,6,,,,,,,,,,,,
3,Fujian,Mainland China,1,,,,,,,,,,,,
4,Guangdong,Mainland China,26,,,,,,,,,,,,


### apply() function
>- To systematically change 'Country_Region' names
>- apply(func, axis= 0 or 1)
    - axis=0: apply to columns
    - axis=1: apply to rows

In [21]:
import pandas as pd
student_score = {
    'Math': [60, 70],
    'Literature': [85, 72]
}

student_df = pd.DataFrame(student_score, index = ['Ben', 'Will'])
student_df

Unnamed: 0,Math,Literature
Ben,60,85
Will,70,72


In [23]:
def func(df_arg): #argument is dataframe
    df_arg['Math'] = 80
    return df_arg

In [24]:
df_score_func = student_df.apply(func, axis=1)
df_score_func

Unnamed: 0,Math,Literature
Ben,80,85
Will,80,72


### <font color='red'>Make Consistent Names for "Country_Region"</font>
>- Create a json file as a new reference for Country Name
    - file name: country_convert.json
>- Apply new country name using apply()

In [25]:
import pandas as pd

PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
doc = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
try:
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']

doc = doc.dropna(subset=['Confirmed']) #Delete rows with missing values on 'Confirmed'
doc = doc.astype({'Confirmed': 'int64'})

doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26


In [26]:
import json

with open("COVID-19-master/csse_covid_19_data/country_convert.json", 'r', encoding='utf-8-sig') as json_file:
    json_data = json.load(json_file)
    print(json_data.keys())

dict_keys(['Mainland China', 'Macau', 'South Korea', 'Aruba', ' Azerbaijan', 'Bahamas, The', 'Cape Verde', 'Cayman Islands', 'Channel Islands', 'Curacao', 'Czech Republic', 'East Timor', 'Faroe Islands', 'French Guiana', 'Gambia, The', 'Gibraltar', 'Greenland', 'Guadeloupe', 'Guam', 'Guernsey', 'Hong Kong', 'Hong Kong SAR', 'Iran (Islamic Republic of)', 'Ivory Coast', 'Jersey', 'Macao SAR', 'Martinique', 'Mayotte', 'North Ireland', 'Palestine', 'Puerto Rico', 'Republic of Ireland', 'Republic of Korea', 'Republic of Moldova', 'Republic of the Congo', 'Reunion', 'Russian Federation', 'Saint Barthelemy', 'Saint Martin', 'St. Martin', 'Taipei and environs', 'The Bahamas', 'The Gambia', 'UK', 'Vatican City', 'Viet Nam', 'occupied Palestinian territory', 'Taiwan*', 'Malawi', 'South Sudan', 'Western Sahara', 'Namibia'])


>- Create apply function

In [27]:
def func(df_arg):
    if df_arg['Country_Region'] in json_data:
        df_arg['Country_Region'] = json_data[df_arg['Country_Region']]
    return df_arg

In [28]:
doc = doc.apply(func, axis=1)
doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26
