## Raw Data를 파이썬으로 조작하여 만든 그래프

### [ 1. 데이터 시각화 ]
- 데이터 분석 결과를 쉽게 이해할 수 있도록 시각적으로 표현하고 전달되는 과정
- 결과를 쉽게 알아보기 위해 데이터 시각화는 필수적이다.
- https://app.flourish.studio

### [ 2. 데이터 시각화를 위한 데이터 포멧 이해]
- 데이터 시각화를 위해, raw data를 변환해야한다.
- 필요 데이터 : 국가명, 국기, 날짜별 확진자 수

<img src="https://www.fun-coding.org/00_Images/covid_ex_data_format.jpg" />

### [ 3. Raw Data 가져오기 ]

#### 아래와 같이 Raw Data를 살펴보면 3월 전 까지는 지역 관련 데이터 컬럼이 Country/Region 으로 표기되어 있으나 4월 후 부터는 Country_Region으로 표기되어있음을 확인 할 수 있다. 

In [1]:
import pandas as pd
path = 'COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
doc = pd.read_csv(path + '04-01-2020.csv', encoding = 'utf-8-sig')
doc.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,0,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,0,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,0,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"


In [15]:
import pandas as pd
path = 'COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
doc = pd.read_csv(path + '03-01-2020.csv', encoding = 'utf-8-sig')
doc.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
0,Hubei,Mainland China,2020-03-01T10:13:19,66907,2761,31536,30.9756,112.2707
1,,South Korea,2020-03-01T23:43:03,3736,17,30,36.0,128.0
2,,Italy,2020-03-01T23:23:02,1694,34,83,43.0,12.0
3,Guangdong,Mainland China,2020-03-01T14:13:18,1349,7,1016,23.3417,113.4244
4,Henan,Mainland China,2020-03-01T14:13:18,1272,22,1198,33.882,113.614


#### 따라서, 위의 특정 컬럼명을 동일 표기하기 위해서 try,except 구문을 사용하여 데이터를 조작한다.

In [11]:
doc = pd.read_csv(path + '01-22-2020.csv', encoding = 'utf-8-sig')
try:
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']

# 이를 이용해서 csv파일을 반복해서 읽어오게하면,
# Province/State의 컬럼명들은 Province_State로 수정된다.
doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1.0
1,Beijing,Mainland China,14.0
2,Chongqing,Mainland China,6.0
3,Fujian,Mainland China,1.0
4,Gansu,Mainland China,


### [ 4. 데이터프레임의 데이터 변환하기 ]
- STEP 1. 특정 컬럼만 선택해서 데이터 프레임 만들기
- STEP 2. 특정 컬럼에 없는 데이터(NaN) 삭제하기
- STEP 3. 특정 컬럼의 데이터 타입 변경하기

In [12]:
doc = pd.read_csv(path + '01-22-2020.csv', encoding = 'utf-8-sig')
try:
    # STEP 1. 특정 컬럼만 선택해서 데이터 프레임 만들기
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    # STEP 1. 특정 컬럼만 선택해서 데이터 프레임 만들기
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']
# STEP 2. 특정 컬럼에 없는 데이터(NaN) 삭제하기
doc = doc.dropna(subset = ['Confirmed'])
# STEP 3. 특정 컬럼의 데이터 타입 변경하기
doc = doc.astype({'Confirmed' : 'int64'})
doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26


- 국가 정보 가져오기
- 아래 차트에서는 iso2에 해당 국가의 단축어가 입력되어있다.
- 이 단축어를 이용하여 외부에서 국기 이미지를 가져 올 수 있다.

In [13]:
country_info = pd.read_csv("COVID-19-master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv", encoding='utf-8-sig')
country_info.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key
0,0,0,,BW,,,,,,Botswana,,,Botswana
1,1,1,,BI,,,,,,Burundi,,,Burundi
2,2,2,,SL,,,,,,Sierra Leone,,,Sierra Leone
3,3,3,4.0,AF,AFG,4.0,,,,Afghanistan,33.93911,67.709953,Afghanistan
4,4,4,8.0,AL,ALB,8.0,,,,Albania,41.1533,20.1683,Albania


- 두 데이터 프레임 합쳐보기
- 실제 관심있는 데이터는 doc 이므로, doc를 기준으로 설정하기 위하여 left
- 또한, 데이터 프레임을 합쳤을 때, info() 함수를 이용하여, iso2가 NaN인 경우를 확인 할 수있다.

In [15]:
test_df = pd.merge(doc, country_info, how = 'left', on = 'Country_Region')
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3333 entries, 0 to 3332
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Province_State_x  3330 non-null   object 
 1   Country_Region    3333 non-null   object 
 2   Confirmed         3333 non-null   int64  
 3   Unnamed: 0        3308 non-null   float64
 4   Unnamed: 0.1      3308 non-null   float64
 5   UID               3308 non-null   float64
 6   iso2              3308 non-null   object 
 7   iso3              3308 non-null   object 
 8   code3             3308 non-null   float64
 9   FIPS              3302 non-null   float64
 10  Admin2            3246 non-null   object 
 11  Province_State_y  3305 non-null   object 
 12  Lat               3203 non-null   float64
 13  Long_             3203 non-null   float64
 14  Combined_Key      3308 non-null   object 
dtypes: float64(7), int64(1), object(7)
memory usage: 416.6+ KB


- NaN 데이터 확인
- 예를들어, Mainland China는 country_info 데이터에서 china로 입력되어있기 때문에, Mainland China는 Chaina로 데이터 수정이 필요하다. 
- 그 외에도,South korea의 경우, country_info 데이터에서는 Korea, South로 입력되어있기에 수정이 필요하다.

In [17]:
test_df.isnull().sum()

Province_State_x      3
Country_Region        0
Confirmed             0
Unnamed: 0           25
Unnamed: 0.1         25
UID                  25
iso2                 25
iso3                 25
code3                25
FIPS                 31
Admin2               87
Province_State_y     28
Lat                 130
Long_               130
Combined_Key         25
dtype: int64

In [22]:
nan_rows = test_df[test_df['iso2'].isnull()]
nan_rows

Unnamed: 0.2,Province_State_x,Country_Region,Confirmed,Unnamed: 0,Unnamed: 0.1,UID,iso2,iso3,code3,FIPS,Admin2,Province_State_y,Lat,Long_,Combined_Key
0,Anhui,Mainland China,1,,,,,,,,,,,,
1,Beijing,Mainland China,14,,,,,,,,,,,,
2,Chongqing,Mainland China,6,,,,,,,,,,,,
3,Fujian,Mainland China,1,,,,,,,,,,,,
4,Guangdong,Mainland China,26,,,,,,,,,,,,
5,Guangxi,Mainland China,2,,,,,,,,,,,,
6,Guizhou,Mainland China,1,,,,,,,,,,,,
7,Hainan,Mainland China,4,,,,,,,,,,,,
8,Hebei,Mainland China,1,,,,,,,,,,,,
9,Henan,Mainland China,5,,,,,,,,,,,,


### [ 5. 컬럼값 변경하기 ]
- Country_Region 국가명이 다양한 경우가 많았음
- 각 케이스를 일괄적으로 변경할 키값이 존재하지 않고, 키가 될 수 있는 컬럼도 다양하고, 각 파일마다 키가 될 수 있는 컬럼이 변경되어, 키값으로 매칭이 불가하였음
- 이에 각 케이스를 직접 확인해서, 국가명을 일관되게 변경할 수 있도록 별도 json 파일 작성
- json 파일 기반으로 국가명을 일관되게 변경하기로 함

In [24]:
import json

with open('COVID-19-master/csse_covid_19_data/country_convert.json','r',encoding ='utf-8-sig') as json_file:
          json_data = json.load(json_file)
          print(json_data)

{'Mainland China': 'China', 'Macau': 'China', 'South Korea': 'Korea, South', 'Aruba': 'Netherlands', ' Azerbaijan': 'Azerbaijan', 'Bahamas, The': 'Bahamas', 'Cape Verde': 'Cabo Verde', 'Cayman Islands': 'United Kingdom', 'Channel Islands': 'United Kingdom', 'Curacao': 'Netherlands', 'Czech Republic': 'Czechia', 'East Timor': 'Timor-Leste', 'Faroe Islands': 'Denmark', 'French Guiana': 'France', 'Gambia, The': 'Gambia', 'Gibraltar': 'United Kingdom', 'Greenland': 'Denmark', 'Guadeloupe': 'France', 'Guam': 'US', 'Guernsey': 'US', 'Hong Kong': 'China', 'Hong Kong SAR': 'China', 'Iran (Islamic Republic of)': 'Iran', 'Ivory Coast': "Cote d'Ivoire", 'Jersey': 'US', 'Macao SAR': 'China', 'Martinique': 'France', 'Mayotte': 'France', 'North Ireland': 'United Kingdom', 'Palestine': 'West Bank and Gaza', 'Puerto Rico': 'US', 'Republic of Ireland': 'Ireland', 'Republic of Korea': 'Korea, South', 'Republic of Moldova': 'Moldova', 'Republic of the Congo': 'Congo (Brazzaville)', 'Reunion': 'France', '

### [ 6. apply() 함수 사용법 ]
#### apply() 함수를 사용해서, 특정 컬럼값 변경 가능

In [26]:
df = pd.DataFrame({
    '영어' : [70, 90],
    '수학' : [100, 50],
}, index = ['Dave', 'David'])

df

Unnamed: 0,영어,수학
Dave,70,100
David,90,50


In [27]:
def func(df_data):
    print(type(df_data))
    print(df_data.index)
    print(df_data.values)
    return df_data

#### apply 함수의 axis 옵션이 0 일 경우, df의 행이 함수 옵션의 데이터로 삽입된다.

In [28]:
df_func = df.apply(func, axis = 0)

<class 'pandas.core.series.Series'>
Index(['Dave', 'David'], dtype='object')
[70 90]
<class 'pandas.core.series.Series'>
Index(['Dave', 'David'], dtype='object')
[70 90]
<class 'pandas.core.series.Series'>
Index(['Dave', 'David'], dtype='object')
[100  50]


#### apply 함수의 axis 옵션이 1 일 경우, df의 열이 함수 옵션의 데이터로 삽입된다.

In [29]:
df_func=df.apply(func, axis = 1)

<class 'pandas.core.series.Series'>
Index(['영어', '수학'], dtype='object')
[ 70 100]
<class 'pandas.core.series.Series'>
Index(['영어', '수학'], dtype='object')
[ 70 100]
<class 'pandas.core.series.Series'>
Index(['영어', '수학'], dtype='object')
[90 50]


- 이때, apply() 함수를 이용하면, 첫 번쨰 행 또는 열에 대해서는 두번 호출하도록 구현되어 있기 때문에, 위처럼 총 세번 func가 호출된다.

#### 예시

In [31]:
df = pd.DataFrame({
    '영어' : [70, 90],
    '수학' : [100, 50],
}, index = ['Dave', 'David'])

df

Unnamed: 0,영어,수학
Dave,70,100
David,90,50


In [36]:
def df_func1(df_data):
    df_data['영어'] = 80
    return df_data

In [39]:
df = df.apply(df_func1, axis = 1)
df

Unnamed: 0,영어,수학
Dave,80,100
David,80,50


In [40]:
def df_func2(df_data):
    df_data['Dave'] = 100
    return df_data

In [42]:
df = df.apply(df_func2, axis = 0)
df

Unnamed: 0,영어,수학
Dave,100,100
David,80,50


### [ 7. apply() 함수를 사용해서, 국가 컬럼값 변경하기 ]

In [43]:
import pandas as pd
doc = pd.read_csv(path + '01-22-2020.csv', encoding = 'utf-8-sig')
try:
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']
doc = doc.dropna(subset = ['Confirmed'])
doc = doc.astype({'Confirmed' : 'int64'})
doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26


- 변경할 국가명을 가지고 있는 json 파일 읽기

In [55]:
import json
with open('COVID-19-master/csse_covid_19_data/country_convert.json', 'r', encoding = 'utf-8-sig') as json_file:
    json_data = json.load(json_file)
    print(json_data.items()

dict_items([('Mainland China', 'China'), ('Macau', 'China'), ('South Korea', 'Korea, South'), ('Aruba', 'Netherlands'), (' Azerbaijan', 'Azerbaijan'), ('Bahamas, The', 'Bahamas'), ('Cape Verde', 'Cabo Verde'), ('Cayman Islands', 'United Kingdom'), ('Channel Islands', 'United Kingdom'), ('Curacao', 'Netherlands'), ('Czech Republic', 'Czechia'), ('East Timor', 'Timor-Leste'), ('Faroe Islands', 'Denmark'), ('French Guiana', 'France'), ('Gambia, The', 'Gambia'), ('Gibraltar', 'United Kingdom'), ('Greenland', 'Denmark'), ('Guadeloupe', 'France'), ('Guam', 'US'), ('Guernsey', 'US'), ('Hong Kong', 'China'), ('Hong Kong SAR', 'China'), ('Iran (Islamic Republic of)', 'Iran'), ('Ivory Coast', "Cote d'Ivoire"), ('Jersey', 'US'), ('Macao SAR', 'China'), ('Martinique', 'France'), ('Mayotte', 'France'), ('North Ireland', 'United Kingdom'), ('Palestine', 'West Bank and Gaza'), ('Puerto Rico', 'US'), ('Republic of Ireland', 'Ireland'), ('Republic of Korea', 'Korea, South'), ('Republic of Moldova', 'Mo

- Country_Region 이라는 컬럼값을 확인해서, 국가명이 다르게 기재되어 있을 경우에만, 지정한 국가명으로 변경

In [49]:
def func(data):
    # ex) json_data의 Mainland China라는 키 값이 data['Country_Region']에 존재하기 때문에 True
    if data['Country_Region'] in json_data:   
        # 따라서, json_data['Mainland China'] 즉, Json 데이터에서 Mainland China 키값의 value인 china가
        # data['Country_Region']이 된다.
        # 그러므로, doc파일의 Mainland China는 China로 수정되어진다.
        data['Country_Region'] = json_data[data['Country_Region']]  
    return data

In [53]:
doc = doc.apply(func ,axis = 1)
doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26


### [ 참고 : 파일명으로 데이터 변환하기 ]

- lstrip() : 왼쪽에서 부터 특정 데이터 삭제하기
- rstrip() : 오른쪽에서 부터 특정 데이터 삭제하기
- replace(변경전 데이터, 변경후 데이터) : 문자열에서 변경전 데이터를 변경후 데이터 로 변경

In [62]:
data = '01-22-2020.csv'
data_column = data.split('.')[0].lstrip('0').replace( '-' , '/' )
data_column

'1/22/2020'

In [66]:
doc.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26


In [67]:
doc.columns

Index(['Province_State', 'Country_Region', 'Confirmed'], dtype='object')

In [69]:
doc.columns = ['Province_State', 'Country_Region', data_column]
doc.columns

Index(['Province_State', 'Country_Region', '1/22/2020'], dtype='object')

In [72]:
doc.head()

Unnamed: 0,Province_State,Country_Region,1/22/2020
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26


### [ 8. 중복 데이터 합치기 ]

- groupby() : 그룹별로 데이터를 집계하는 함수
- 이 함수를 사용하면, 문자열 데이터를 value로 가지는 컬럼은 없어진다.

In [77]:
df = pd.DataFrame({
    '성별' : ['남', '남', '남'],
    '이름' : ['David', 'Dave', 'Dave'],
    '수학' : [100, 50, 80],
    '국어' : [80, 70, 50]
})
df

Unnamed: 0,성별,이름,수학,국어
0,남,David,100,80
1,남,Dave,50,70
2,남,Dave,80,50


In [78]:
df.groupby('이름').mean()

Unnamed: 0_level_0,수학,국어
이름,Unnamed: 1_level_1,Unnamed: 2_level_1
Dave,65,60
David,100,80


In [80]:
df.groupby('이름').sum()

Unnamed: 0_level_0,수학,국어
이름,Unnamed: 1_level_1,Unnamed: 2_level_1
Dave,130,120
David,100,80


#### 국가별 총 확진자 수 구하기

In [88]:
import pandas as pd

doc = pd.read_csv(path + '01-22-2020.csv', encoding = 'utf-8-sig')

try:
    doc = doc[['Province_State', 'Country_Region', 'Confirmed']]
except:
    doc = doc[['Province/State', 'Country/Region', 'Confirmed']]
    doc.columns = ['Province_State', 'Country_Region', 'Confirmed']
doc = doc.dropna(subset = ['Confirmed'])
doc = doc.astype({'Confirmed' : 'int64'})

doc = doc.groupby('Country_Region').sum()
doc

Unnamed: 0_level_0,Confirmed
Country_Region,Unnamed: 1_level_1
Japan,2
Macau,1
Mainland China,547
South Korea,1
Taiwan,1
Thailand,2
US,1


### [ 9. 데이터 전처리하기 ]

- 위에서 해온 것들을 함수로 만들기
1. 필요한 파일 리스트만 추출하기
2. 파일 리스트 정렬하기
3. 데이터프레임 전처리하기
4. 데이터프레임 합치기

In [63]:
# Country_Region의 명을 통일 시키기 위한 json파일 불러오기
import json

with open('COVID-19-master/csse_covid_19_data/country_convert.json', 'r', encoding = 'utf-8-sig') as json_file:
    json_data = json.load(json_file)

# Country_Region 데이터 value 통일
def country_name_convert(row):
    if row['Country_Region'] in json_data:
        return json_data[row['Country_Region']]
    return row['Country_Region']
# def country_name_convert(row):
#     if row['Country_Region'] in json_data:
#         row['Country_Region'] = json_data[row['Country_Region']]
#     return row

# 데이터 프레임 병합
def creat_dataframe(filename):
    doc = pd.read_csv(path + filename, encoding = 'utf-8-sig')
    
    try:
        doc = doc[['Country_Region', 'Confirmed']]
    except:
        doc = doc[['Country/Region', 'Confirmed']]
        doc.columns = ['Country_Region', 'Confirmed']
    
    
    doc = doc.dropna(subset = ['Confirmed'])
    doc = doc.astype({'Confirmed' : 'int64'})
    doc['Country_Region'] = doc.apply(country_name_convert, axis = 1)
  # doc = doc.apply(country_name_convert, axis = 1)
    doc = doc.groupby('Country_Region').sum()
    
    
    column_date = filename.split('.')[0].lstrip('0').replace('-','/')
    # doc 컬럼 수정
    doc.columns = [column_date]
    return doc

In [73]:
doc1 = creat_dataframe('01-22-2020.csv')
doc2 = creat_dataframe('04-01-2020.csv')
doc2

Unnamed: 0_level_0,4/01/2020
Country_Region,Unnamed: 1_level_1
Afghanistan,237
Albania,259
Algeria,847
Andorra,390
Angola,8
...,...
Venezuela,143
Vietnam,218
West Bank and Gaza,134
Zambia,36


#### 데이터프레임 합치기

In [67]:
doc = pd.merge(doc1, doc2, how = 'outer', left_index = True, right_index = True)
doc.head()

Unnamed: 0_level_0,1/22/2020,4/01/2020
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,,237
Albania,,259
Algeria,,847
Andorra,,390
Angola,,8


In [69]:
doc = doc.fillna(0)
doc.head()

Unnamed: 0_level_0,1/22/2020,4/01/2020
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,0.0,237
Albania,0.0,259
Algeria,0.0,847
Andorra,0.0,390
Angola,0.0,8


#### 참고 : 특정 폴더 파일 리스트 확인하기
- split() 함수를 사용해서 특정 확장자를 가진 파일 리스트만 추출 가능
- 문자열변수.split('.') 은 ['파일명', '확장자'] 와 같은 리스트가 반환되므로, 문자열변수.split('.')[-1] 을 통해, 이 중에서 마지막 아이템을 선택하면 됨
  

In [1]:
import os

path = 'COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
file_list = os.listdir(path)
csv_list = []

for file in file_list:
    # csv 확장자만 뽑아내기
    if file.split('.')[-1] == 'csv': # 반드시 -1!!
        csv_list.append(file)

print(csv_list)

['01-22-2020.csv', '01-23-2020.csv', '01-24-2020.csv', '01-25-2020.csv', '01-26-2020.csv', '01-27-2020.csv', '01-28-2020.csv', '01-29-2020.csv', '01-30-2020.csv', '01-31-2020.csv', '02-01-2020.csv', '02-02-2020.csv', '02-03-2020.csv', '02-04-2020.csv', '02-05-2020.csv', '02-06-2020.csv', '02-07-2020.csv', '02-08-2020.csv', '02-09-2020.csv', '02-10-2020.csv', '02-11-2020.csv', '02-12-2020.csv', '02-13-2020.csv', '02-14-2020.csv', '02-15-2020.csv', '02-16-2020.csv', '02-17-2020.csv', '02-18-2020.csv', '02-19-2020.csv', '02-20-2020.csv', '02-21-2020.csv', '02-22-2020.csv', '02-23-2020.csv', '02-24-2020.csv', '02-25-2020.csv', '02-26-2020.csv', '02-27-2020.csv', '02-28-2020.csv', '02-29-2020.csv', '03-01-2020.csv', '03-02-2020.csv', '03-03-2020.csv', '03-04-2020.csv', '03-05-2020.csv', '03-06-2020.csv', '03-07-2020.csv', '03-08-2020.csv', '03-09-2020.csv', '03-10-2020.csv', '03-11-2020.csv', '03-12-2020.csv', '03-13-2020.csv', '03-14-2020.csv', '03-15-2020.csv', '03-16-2020.csv', '03-17-20

In [4]:
# 오름차순 정리
# 참고로, 내림차순의 경우는 sort(reverse = True)
csv_list.sort()
print(csv_list)

['01-22-2020.csv', '01-23-2020.csv', '01-24-2020.csv', '01-25-2020.csv', '01-26-2020.csv', '01-27-2020.csv', '01-28-2020.csv', '01-29-2020.csv', '01-30-2020.csv', '01-31-2020.csv', '02-01-2020.csv', '02-02-2020.csv', '02-03-2020.csv', '02-04-2020.csv', '02-05-2020.csv', '02-06-2020.csv', '02-07-2020.csv', '02-08-2020.csv', '02-09-2020.csv', '02-10-2020.csv', '02-11-2020.csv', '02-12-2020.csv', '02-13-2020.csv', '02-14-2020.csv', '02-15-2020.csv', '02-16-2020.csv', '02-17-2020.csv', '02-18-2020.csv', '02-19-2020.csv', '02-20-2020.csv', '02-21-2020.csv', '02-22-2020.csv', '02-23-2020.csv', '02-24-2020.csv', '02-25-2020.csv', '02-26-2020.csv', '02-27-2020.csv', '02-28-2020.csv', '02-29-2020.csv', '03-01-2020.csv', '03-02-2020.csv', '03-03-2020.csv', '03-04-2020.csv', '03-05-2020.csv', '03-06-2020.csv', '03-07-2020.csv', '03-08-2020.csv', '03-09-2020.csv', '03-10-2020.csv', '03-11-2020.csv', '03-12-2020.csv', '03-13-2020.csv', '03-14-2020.csv', '03-15-2020.csv', '03-16-2020.csv', '03-17-20

### [ 10. 최종 ]

In [5]:
import os
import json
import pandas as pd

with open('COVID-19-master/csse_covid_19_data/country_convert.json','r', encoding = 'utf-8-sig') as json_file:
    json_data = json.load(json_file)

# 국가이름 바꾸기
def country_convert_name(row):
    if row['Country_Region'] in json_data:
        return json_data[row['Country_Region']]
    return row['Country_Region']
    
    
# 데이터프레임 전처리
def create_dataframe(filename):
    
    doc = pd.read_csv(path + filename, encoding = 'utf-8-sig')
    
    try:
        doc = doc[['Country_Region', 'Confirmed']]
    except:
        doc = doc[['Country/Region', 'Confirmed']]
        doc.columns = ['Country_Region',  'Confirmed']
    
    doc = doc.dropna(subset = ['Confirmed'])
    doc = doc.astype({'Confirmed' : 'int64'})
    doc['Country_Region'] = doc.apply(country_convert_name, axis = 1)
    doc = doc.groupby('Country_Region').sum()
    
    date = filename.split('.')[0].lstrip('0').replace('-','/')
    doc.columns = [date]
    
    return doc 
    
# csv파일 읽어오기
def generate_dateframe_by_path(path):
    file_list = os.listdir(path)
    csv_list = []
    first_doc = True
     
    for file in file_list:
        if file.split('.')[-1] == 'csv':
            csv_list.append(file)
    
    csv_list.sort()
    
    for file in csv_list:
        doc = create_dataframe(file)
        # 기본 데이터 프레임을 만들기 위한 first_doc= True
        if first_doc:
            final_doc = doc
            first_doc = False
        # for구문이 돌면서 계속 합병
        else:
            final_doc = pd.merge(final_doc, doc, how = 'outer', left_index = True, right_index = True)
    
    final_doc = final_doc.fillna(0)   
    return final_doc 

In [6]:
path = 'COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
doc = generate_dateframe_by_path(path)
doc

Unnamed: 0_level_0,1/22/2020,1/23/2020,1/24/2020,1/25/2020,1/26/2020,1/27/2020,1/28/2020,1/29/2020,1/30/2020,1/31/2020,...,6/08/2020,6/09/2020,6/10/2020,6/11/2020,6/12/2020,6/13/2020,6/14/2020,6/15/2020,6/16/2020,6/17/2020
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,20917.0,21459.0,22142.0,22890.0,23546.0,24102.0,24766.0,25527.0,26310.0,26874.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1263.0,1299.0,1341.0,1385.0,1416.0,1464.0,1521.0,1590.0,1672.0,1722.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10265.0,10382.0,10484.0,10589.0,10698.0,10810.0,10919.0,11031.0,11147.0,11268.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,852.0,852.0,852.0,852.0,853.0,853.0,853.0,853.0,854.0,854.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,92.0,96.0,113.0,118.0,130.0,138.0,140.0,142.0,148.0,155.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Vietnam,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,332.0,332.0,332.0,332.0,333.0,334.0,334.0,334.0,334.0,335.0
West Bank and Gaza,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,473.0,481.0,485.0,487.0,489.0,489.0,492.0,505.0,514.0,555.0
Yemen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,496.0,524.0,560.0,591.0,632.0,705.0,728.0,844.0,885.0,902.0
Zambia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1200.0,1200.0,1200.0,1200.0,1321.0,1357.0,1358.0,1382.0,1405.0,1412.0


In [16]:
doc = doc.astype('int64')
doc

Unnamed: 0_level_0,1/22/2020,1/23/2020,1/24/2020,1/25/2020,1/26/2020,1/27/2020,1/28/2020,1/29/2020,1/30/2020,1/31/2020,2/01/2020,2/02/2020,2/03/2020,2/04/2020,2/05/2020,2/06/2020,2/07/2020,2/08/2020,2/09/2020,2/10/2020,2/11/2020,2/12/2020,2/13/2020,2/14/2020,2/15/2020,2/16/2020,2/17/2020,2/18/2020,2/19/2020,2/20/2020,2/21/2020,2/22/2020,2/23/2020,2/24/2020,2/25/2020,2/26/2020,2/27/2020,2/28/2020,2/29/2020,3/01/2020,...,5/09/2020,5/10/2020,5/11/2020,5/12/2020,5/13/2020,5/14/2020,5/15/2020,5/16/2020,5/17/2020,5/18/2020,5/19/2020,5/20/2020,5/21/2020,5/22/2020,5/23/2020,5/24/2020,5/25/2020,5/26/2020,5/27/2020,5/28/2020,5/29/2020,5/30/2020,5/31/2020,6/01/2020,6/02/2020,6/03/2020,6/04/2020,6/05/2020,6/06/2020,6/07/2020,6/08/2020,6/09/2020,6/10/2020,6/11/2020,6/12/2020,6/13/2020,6/14/2020,6/15/2020,6/16/2020,6/17/2020
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
Afghanistan,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,...,4033,4402,4687,4963,5226,5639,6053,6402,6664,7072,7653,8145,8676,9216,9998,10582,11173,11831,12456,13036,13659,14525,15205,15750,16509,17267,18054,18969,19551,20342,20917,21459,22142,22890,23546,24102,24766,25527,26310,26874
Albania,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,856,868,872,876,880,898,916,933,946,948,949,964,969,981,989,998,1004,1029,1050,1076,1099,1122,1137,1143,1164,1184,1197,1212,1232,1246,1263,1299,1341,1385,1416,1464,1521,1590,1672,1722
Algeria,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,...,5558,5723,5891,6067,6253,6442,6629,6821,7019,7201,7377,7542,7728,7918,8113,8306,8503,8697,8857,8997,9134,9267,9394,9513,9626,9733,9831,9935,10050,10154,10265,10382,10484,10589,10698,10810,10919,11031,11147,11268
Andorra,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,754,755,755,758,760,761,761,761,761,761,761,762,762,762,762,762,763,763,763,763,764,764,764,765,844,851,852,852,852,852,852,852,852,852,853,853,853,853,854,854
Angola,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,43,45,45,45,45,48,48,48,48,50,52,52,58,60,61,69,70,70,71,74,81,84,86,86,86,86,86,86,88,91,92,96,113,118,130,138,140,142,148,155
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Vietnam,0,2,2,2,2,2,2,2,2,2,6,6,8,8,8,10,10,13,13,14,15,15,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,...,288,288,288,288,288,312,314,318,320,324,324,324,324,324,325,325,326,327,327,327,328,328,328,328,328,328,328,328,329,331,332,332,332,332,333,334,334,334,334,335
West Bank and Gaza,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,375,375,375,375,375,375,375,376,381,388,391,398,423,423,423,423,423,429,434,446,446,447,448,449,451,457,464,464,464,472,473,481,485,487,489,489,492,505,514,555
Yemen,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,34,51,56,65,70,85,106,122,128,130,167,184,197,209,212,222,233,249,256,278,283,310,323,354,399,419,453,469,482,484,496,524,560,591,632,705,728,844,885,902
Zambia,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,252,267,267,441,446,654,654,679,753,761,772,832,866,920,920,920,920,920,1057,1057,1057,1057,1057,1089,1089,1089,1089,1089,1089,1089,1200,1200,1200,1200,1321,1357,1358,1382,1405,1412


#### pandas 라이브러리로 csv 파일 쓰기
- pandas dataframe 데이터를 csv 파일로 저장하기 위해, to_csv() 함수 사용
    ```
    doc.to_csv("path/filename.csv")
    ```

- encoding 옵션 사용 가능
    ```
    doc.to_csv("path/filename.csv", encoding='utf-8-sig')
    ```

In [17]:
doc.to_csv("./covid-19.csv", encoding='utf-8-sig')