# 9. 데이터 타이딩과 리셰이핑
---
### 타이디하지 못한 데이터의 특징
- 병합 기준 열 관계에 대한 명확성이 부족함
- 일대다 관계의 '일'에 해당하는 쪽에 중복이 있음
- 다대다 관계로 인해 데이터 중복됨
- 열 이름에 값이 저장됨
- 하나의 변숫값에 여러 값이 저장됨
- 데이터가 분석 단위에 맞게 구조화되지 않았음
---
### Key Methods
- 중복 행 제거
- 다대다 관계 수정
- stack & melt를 사용해 넓은 포맷을 긴 포맷으로 리셰이핑
- 열 그룹 녹이기
- unstack & pivot을 사용해 긴 포맷을 넓은 포맷으로 리셰이핑

## (1)중복 행 제거하기
- 데이터는 코로나19 데이터 

In [1]:
import pandas as pd
covidcases = pd.read_csv('data/covidcases720.csv')

In [2]:
# 일일 확진자, 사망자, 총 확진자, 인구통계 열 리스트 작성
dailyvars = ['casedate', 'new_cases', 'new_deaths']
totvars = ['location', 'total_cases', 'total_deaths']
demovars = ['population', 'population_density', 'median_age', 'gdp_per_capita', 'hospital_beds_per_thousand', 'region']

In [3]:
covidcases[dailyvars + totvars + demovars].head(3).T

Unnamed: 0,0,1,2
casedate,2019-12-31,2020-01-01,2020-01-02
new_cases,0.0,0.0,0.0
new_deaths,0.0,0.0,0.0
location,Afghanistan,Afghanistan,Afghanistan
total_cases,0.0,0.0,0.0
total_deaths,0.0,0.0,0.0
population,38928341.0,38928341.0,38928341.0
population_density,54.422,54.422,54.422
median_age,18.6,18.6,18.6
gdp_per_capita,1803.987,1803.987,1803.987


### 일일 데이터만 있는 데이터프레임 생

In [4]:
coviddaily = covidcases[['location'] + dailyvars]

In [5]:
coviddaily.shape

(29529, 4)

In [6]:
coviddaily.head()

Unnamed: 0,location,casedate,new_cases,new_deaths
0,Afghanistan,2019-12-31,0.0,0.0
1,Afghanistan,2020-01-01,0.0,0.0
2,Afghanistan,2020-01-02,0.0,0.0
3,Afghanistan,2020-01-03,0.0,0.0
4,Afghanistan,2020-01-04,0.0,0.0


### 국가별로 한 행을 선택
- 고유한 위치의 개수를 구하기 -> 얼마나 많은 Location(국가)이 있는지 예상
- location과 casedate를 기준으로 정렬
- drop_duplicates() 사용해, location별로 한 행씩 선택, keep매개변수로 마지막 행을 살림

In [7]:
covidcases.location.nunique()

209

In [11]:
coviddemo = covidcases[['casedate'] + totvars + demovars].sort_values(['location', 'casedate']).drop_duplicates(['location'], keep='last').\
rename(columns={'casedate':'lastdate'})

In [12]:
coviddemo.head()

Unnamed: 0,lastdate,location,total_cases,total_deaths,population,population_density,median_age,gdp_per_capita,hospital_beds_per_thousand,region
184,2020-07-12,Afghanistan,34451.0,1010.0,38928341.0,54.422,18.6,1803.987,0.5,South Asia
310,2020-07-12,Albania,3371.0,89.0,2877800.0,104.871,38.0,11803.431,2.89,Eastern Europe
500,2020-07-12,Algeria,18712.0,1004.0,43851043.0,17.348,29.1,13913.839,1.9,North Africa
621,2020-07-12,Andorra,855.0,52.0,77265.0,163.755,,,,Western Europe
734,2020-07-12,Angola,483.0,25.0,32866268.0,23.89,16.8,5819.495,,Central Africa


In [13]:
coviddemo.shape

(209, 10)

In [14]:
coviddemo.head(3).T

Unnamed: 0,184,310,500
lastdate,2020-07-12,2020-07-12,2020-07-12
location,Afghanistan,Albania,Algeria
total_cases,34451.0,3371.0,18712.0
total_deaths,1010.0,89.0,1004.0
population,38928341.0,2877800.0,43851043.0
population_density,54.422,104.871,17.348
median_age,18.6,38.0,29.1
gdp_per_capita,1803.987,11803.431,13913.839
hospital_beds_per_thousand,0.5,2.89,1.9
region,South Asia,Eastern Europe,North Africa


### 그룹별로 값을 합산
- groupby 메서드로 국가별 총 확진자 수 및 사망자 수 합산

In [15]:
covidtotals = covidcases.groupby(['location'], as_index=False).\
                        agg({'new_cases':'sum', 'new_deaths':'sum', 'median_age':'last',
                            'gdp_per_capita':'last', 'region':'last', 'casedate':'last',
                            'population':'last'}).rename(columns={'new_cases':'total_cases', 'new_deaths':'total_deaths', 'casedate':'lastdate'})

In [17]:
covidtotals.head(3).T

Unnamed: 0,0,1,2
location,Afghanistan,Albania,Algeria
total_cases,34451.0,3371.0,18712.0
total_deaths,1010.0,89.0,1004.0
median_age,18.6,38.0,29.1
gdp_per_capita,1803.987,11803.431,13913.839
region,South Asia,Eastern Europe,North Africa
lastdate,2020-07-12,2020-07-12,2020-07-12
population,38928341.0,2877800.0,43851043.0


> 👉 drop_duplicates를 선택할지, groupby를 선택할지는 '다'에 해당하는 쪽을 축소하기에 앞서, 집계할 필요가 있는지에 따라 결정한다!