# EDA using Pandas Part II.

## COVID-19 Dataset
* https://github.com/CSSEGISandData/COVID-19
* Source: Opensource from Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)

<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">Dataset</font><br>
</div>

In [2]:
import pandas as pd
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
covid = pd.read_csv(PATH + "04-01-2020.csv", encoding='utf-8-sig')

In [5]:
covid.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"


<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">5. Group Dataframe</font><br>
</div>

- groupby(column): Group DataFrame by a Series of columns
- groupby(column).sum()

In [3]:
df_test = pd.DataFrame({
    'Gender': ['M', 'M', 'F', 'F'],
    'Name': ['Ben', 'Will', 'Danielle', 'Melissa'],
    'Math': [85, 80, 90, 65],
    'Literature': [65, 80, 78, 95]    
})
df_test

Unnamed: 0,Gender,Name,Math,Literature
0,M,Ben,85,65
1,M,Will,80,80
2,F,Danielle,90,78
3,F,Melissa,65,95


In [4]:
df_test.groupby('Gender').mean()

Unnamed: 0_level_0,Math,Literature
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,77.5,86.5
M,82.5,72.5


### COVID-19 Data Example
>- Index changed

In [3]:
covid = covid.groupby('Country_Region').sum()
covid.head()

Unnamed: 0_level_0,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,0.0,33.93911,67.709953,237,4,5,228
Albania,0.0,41.1533,20.1683,259,15,67,177
Algeria,0.0,28.0339,1.6596,847,58,61,728
Andorra,0.0,42.5063,1.5218,390,14,10,366
Angola,0.0,-11.2027,17.8739,8,2,1,5


In [5]:
covid.index.name

'Country_Region'

>- Get Subset of New Dataframe 

In [4]:
covid[covid.index=='US']

Unnamed: 0_level_0,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
US,65168934.0,82956.96013,-197553.963757,213372,4757,8474,0


<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">6. Change Datatype in Dataframe</font><br>
</div>


> - object: string (문자열)
> - int64: integer
> - float64: float
> - bool: boolean

In [8]:
import pandas as pd
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
covid_Jan22 = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
covid_Jan22.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Anhui,Mainland China,1/22/2020 17:00,1.0,,
1,Beijing,Mainland China,1/22/2020 17:00,14.0,,
2,Chongqing,Mainland China,1/22/2020 17:00,6.0,,
3,Fujian,Mainland China,1/22/2020 17:00,1.0,,
4,Gansu,Mainland China,1/22/2020 17:00,,,


In [9]:
covid_Jan22 = covid_Jan22[['Country/Region', 'Confirmed']]
covid_Jan22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country/Region  38 non-null     object 
 1   Confirmed       29 non-null     float64
dtypes: float64(1), object(1)
memory usage: 736.0+ bytes


> - Series.astype('datatype')
> - Dataframe.astype({'column': 'datatype'})

In [10]:
covid_Jan22 = covid_Jan22.astype({'Confirmed': 'int64'})
covid_Jan22.info()

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

>- Error occurs when a column contains missing values

In [11]:
covid_Jan22 = covid_Jan22.dropna(subset=['Confirmed'])
covid_Jan22 = covid_Jan22.astype({'Confirmed': 'int64'})
covid_Jan22.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29 entries, 0 to 37
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Country/Region  29 non-null     object
 1   Confirmed       29 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 696.0+ bytes


In [12]:
covid_Jan22.head()

Unnamed: 0,Country/Region,Confirmed
0,Mainland China,1
1,Mainland China,14
2,Mainland China,6
3,Mainland China,1
5,Mainland China,26


<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">7. Change Columns in Dataframe</font><br>
</div>

In [13]:
import pandas as pd
PATH = "COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
covid_Jan22 = pd.read_csv(PATH + "01-22-2020.csv", encoding='utf-8-sig')
covid_Jan22 = covid_Jan22[['Country/Region', 'Confirmed']]

In [14]:
covid_Jan22.columns

Index(['Country/Region', 'Confirmed'], dtype='object')

In [15]:
covid_Jan22.columns = ['Country_Region', 'Confirmed']

In [16]:
covid_Jan22.columns

Index(['Country_Region', 'Confirmed'], dtype='object')

<div class="alert alert-block" style="border: 1px solid #FFB300;background-color:#F9FBE7;">
<font size="4em" style="font-weight:bold;color:#3f8dbf;">8. Dealing with Duplicating Values</font><br>
</div>

In [13]:
doc = pd.read_csv("COVID-19-master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv", encoding='utf-8-sig')
doc.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key
0,0,0,,BW,,,,,,Botswana,,,Botswana
1,1,1,,BI,,,,,,Burundi,,,Burundi
2,2,2,,SL,,,,,,Sierra Leone,,,Sierra Leone
3,3,3,4.0,AF,AFG,4.0,,,,Afghanistan,33.93911,67.709953,Afghanistan
4,4,4,8.0,AL,ALB,8.0,,,,Albania,41.1533,20.1683,Albania


- Create Subset of Dataframe

In [14]:
doc = doc[['iso2', 'Country_Region']]
doc

Unnamed: 0,iso2,Country_Region
0,BW,Botswana
1,BI,Burundi
2,SL,Sierra Leone
3,AF,Afghanistan
4,AL,Albania
...,...,...
3555,US,US
3556,US,US
3557,US,US
3558,US,US


>- df.duplicated()
    - Return boolean Series denoting duplicate rows.

>reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

In [20]:
doc.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
3555     True
3556     True
3557     True
3558     True
3559     True
Length: 3560, dtype: bool

### Remove Duplicated Rows
>- df.drop_duplicates(subset='column', keep='first')
    - keep='first' (default) and remove the rest
    - keep='last' and remove the rest

In [23]:
doc = doc.drop_duplicates(subset='Country_Region', keep='last')
doc

Unnamed: 0,iso2,Country_Region
0,BW,Botswana
1,BI,Burundi
2,SL,Sierra Leone
3,AF,Afghanistan
4,AL,Albania
...,...,...
198,TC,United Kingdom
206,AU,Australia
221,CA,Canada
254,CN,China
