# EDA using Pandas

## COVID-19 Dataset
* https://github.com/CSSEGISandData/COVID-19
* Source: Opensource from Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)

<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">1. Explore Dataset</font><br>
</div>

- head()

In [1]:
import pandas as pd
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
covid = pd.read_csv(PATH + "04-01-2020.csv", encoding='utf-8-sig')

In [2]:
covid.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"


<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">2. Explore Series</font><br>
</div>

### 2-1) Get Series
- Explore one column of interest

In [3]:
country=covid['Country_Region']
country.head()

0    US
1    US
2    US
3    US
4    US
Name: Country_Region, dtype: object

### 2-2) Explore Series further
- size: Return the number of observations in the Series.
- count(): Return the size of the Series withouth missing values.
- unique(): Return unique values in the Series.
- value_counts(): Return a Series of counts of unique values.

See more: https://pandas.pydata.org/docs/reference/series.html

In [4]:
print(country.size, country.count())

2483 2483


In [5]:
print(country.unique(), len(country.unique()))

['US' 'Canada' 'United Kingdom' 'China' 'Netherlands' 'Australia'
 'Denmark' 'France' 'Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Austria' 'Azerbaijan'
 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize'
 'Benin' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Botswana' 'Brazil'
 'Brunei' 'Bulgaria' 'Burkina Faso' 'Burma' 'Burundi' 'Cabo Verde'
 'Cambodia' 'Cameroon' 'Central African Republic' 'Chad' 'Chile'
 'Colombia' 'Congo (Brazzaville)' 'Congo (Kinshasa)' 'Costa Rica'
 "Cote d'Ivoire" 'Croatia' 'Cuba' 'Cyprus' 'Czechia' 'Diamond Princess'
 'Djibouti' 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Eswatini'
 'Ethiopia' 'Fiji' 'Finland' 'Gabon' 'Gambia' 'Georgia' 'Germany' 'Ghana'
 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau' 'Guyana' 'Haiti'
 'Holy See' 'Honduras' 'Hungary' 'Iceland' 'India' 'Indonesia' 'Iran'
 'Iraq' 'Ireland' 'Israel' 'It

In [7]:
country.value_counts()

US                2228
China               33
Canada              15
United Kingdom      10
France              10
                  ... 
Gabon                1
Gambia               1
Georgia              1
Germany              1
Zimbabwe             1
Name: Country_Region, Length: 180, dtype: int64

<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">3. Explore Subset of Dataframe</font><br>
</div>

### 3-1) Get Subset of Dataframe 1
- e.g., df [['column1', 'column2', column3']]

In [12]:
covid_stat = covid[['Confirmed', 'Deaths', 'Recovered']]
covid_stat.head()

Unnamed: 0,Confirmed,Deaths,Recovered
0,4,0,0
1,47,1,0
2,7,0,0
3,195,3,0
4,1,0,0


### 3-2) Get Subset of Dataframe 2
- e.g., df [conditions]

In [8]:
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
covid = pd.read_csv(PATH + "04-01-2020.csv", encoding='utf-8-sig')

In [9]:
covid_US = covid[covid['Country_Region'] == 'US']
covid_US.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"


<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">4. Dealing with Missing Data</font><br>
</div>

### 4-1) Dealing with Missing Values
- <b>Detect Missing Values</b>
    - isnull(): Detect missing values
    - sum(): Number of missing values 
    - e.g., isnull().sum()

In [15]:
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
covid_Jan22 = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
covid_Jan22.isnull().sum()

Province/State     3
Country/Region     0
Last Update        0
Confirmed          9
Deaths            37
Recovered         37
dtype: int64

### 4-2) Dealing with Missing Values
>1) dropna()<br>
>2) subset<br>
>3) fillna()<br>
    - c.f., Replace with different values 

>- dropna(): Remove rows with missing values

In [13]:
covid_Jan22 = covid_Jan22.dropna()
covid_Jan22.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
13,Hubei,Mainland China,1/22/2020 17:00,444.0,17.0,28.0


>- subset: Remove rows with missing values for certain columns only

In [17]:
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
covid_Jan22 = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
covid_Jan22 = covid_Jan22.dropna(subset=['Confirmed'])
covid_Jan22.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
5,Guangdong,Mainland China,1/22/2020 17:00,26.0,,


>- fillna(): Replace missing values

In [19]:
covid_Jan22 = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
covid_Jan22 = covid_Jan22.fillna(0)
covid_Jan22.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


>- Replace with different values

In [20]:
covid_Jan22 = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
na_value = {'Deaths': 1, 'Recovered': 2}
covid_Jan22 = covid_Jan22.fillna(na_value)
covid_Jan22.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,1.0,2.0
1,Beijing,Mainland China,1/22/2020 17:00,14.0,1.0,2.0
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,1.0,2.0
3,Fujian,Mainland China,1/22/2020 17:00,1.0,1.0,2.0
4,Gansu,Mainland China,1/22/2020 17:00,,1.0,2.0
