# 11. 데이터 전처리
---
## 결측값 처리
- 데이터 분석에서 결측값은 종종 문제를 발생시킬 수 있으므로 미리 처리해야한다.
- pandas는 결측값을 처리하기 위한 여러 방법을 제공한다.

In [1]:
import pandas as pd
df = pd.read_csv('data/customers3.csv', index_col='고객ID')
df

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35,시흥,Electronics,75,70,78,92
C002,이영희,30,안양,fashion,85,92,88,85
C003,박지성,35,울산,,95,110,91,88
C004,최민아,40,창원,Fashion,92,105,82,96
C005,정수빈,28,구로,Fashion,82,88,94,79
C006,윤태영,27,시흥,Electronics,88,95,85,91
C007,한유진,31,파주,,68,65,77,83
C008,강민호,29,일산,Sports,98,125,96,90


**결측값 제거 `dropna()`**
- 결측치(NaN, Not a Number)를 행또는 열 기준으로 제거해주는 기능
- 원본에는 영향을 끼치지 않고, 새로운 DataFrame이 반환된다.

In [2]:
df.dropna()

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35,시흥,Electronics,75,70,78,92
C002,이영희,30,안양,fashion,85,92,88,85
C004,최민아,40,창원,Fashion,92,105,82,96
C005,정수빈,28,구로,Fashion,82,88,94,79
C006,윤태영,27,시흥,Electronics,88,95,85,91
C008,강민호,29,일산,Sports,98,125,96,90


**how**
- 행이나 열의 값들이 얼마나 결측치여야 제거할지 결정
- 행기준(axis=0)이 기본값으로 조건에 부합시 그행을 제거함.
    - any : 하나라도 NaN이면 제거 (기본값)
    - all : 전부 NaN이어야 제거

In [3]:
df.dropna(how='all')

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35,시흥,Electronics,75,70,78,92
C002,이영희,30,안양,fashion,85,92,88,85
C003,박지성,35,울산,,95,110,91,88
C004,최민아,40,창원,Fashion,92,105,82,96
C005,정수빈,28,구로,Fashion,82,88,94,79
C006,윤태영,27,시흥,Electronics,88,95,85,91
C007,한유진,31,파주,,68,65,77,83
C008,강민호,29,일산,Sports,98,125,96,90


In [5]:
import numpy as np
df.loc['C009'] = np.nan
df

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35.0,시흥,Electronics,75.0,70.0,78.0,92.0
C002,이영희,30.0,안양,fashion,85.0,92.0,88.0,85.0
C003,박지성,35.0,울산,,95.0,110.0,91.0,88.0
C004,최민아,40.0,창원,Fashion,92.0,105.0,82.0,96.0
C005,정수빈,28.0,구로,Fashion,82.0,88.0,94.0,79.0
C006,윤태영,27.0,시흥,Electronics,88.0,95.0,85.0,91.0
C007,한유진,31.0,파주,,68.0,65.0,77.0,83.0
C008,강민호,29.0,일산,Sports,98.0,125.0,96.0,90.0
C009,,,,,,,,


In [9]:
df.dropna(how='all')


Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35.0,시흥,Electronics,75.0,70.0,78.0,92.0
C002,이영희,30.0,안양,fashion,85.0,92.0,88.0,85.0
C003,박지성,35.0,울산,,95.0,110.0,91.0,88.0
C004,최민아,40.0,창원,Fashion,92.0,105.0,82.0,96.0
C005,정수빈,28.0,구로,Fashion,82.0,88.0,94.0,79.0
C006,윤태영,27.0,시흥,Electronics,88.0,95.0,85.0,91.0
C007,한유진,31.0,파주,,68.0,65.0,77.0,83.0
C008,강민호,29.0,일산,Sports,98.0,125.0,96.0,90.0


**subset**
- 결측치를 검사할 특정 열을 지정한다. 지정된 열에 결측치가 있을때만 행을 제거한다.''

In [10]:
df.dropna(subset='주요관심사')

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35.0,시흥,Electronics,75.0,70.0,78.0,92.0
C002,이영희,30.0,안양,fashion,85.0,92.0,88.0,85.0
C004,최민아,40.0,창원,Fashion,92.0,105.0,82.0,96.0
C005,정수빈,28.0,구로,Fashion,82.0,88.0,94.0,79.0
C006,윤태영,27.0,시흥,Electronics,88.0,95.0,85.0,91.0
C008,강민호,29.0,일산,Sports,98.0,125.0,96.0,90.0


**결측값 대체 `fillna()`**
- 결측치(NaN, Not a Number)를 특정 값이나 방식으로 채워주는 기능
- 원본에는 영향을 끼치지 않고, 새로운 Data Frame이 반환된다.

In [None]:
df.fillna('')    #  NaN을 빈 문자열 값으로 대체하는 것

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35.0,시흥,Electronics,75.0,70.0,78.0,92.0
C002,이영희,30.0,안양,fashion,85.0,92.0,88.0,85.0
C003,박지성,35.0,울산,,95.0,110.0,91.0,88.0
C004,최민아,40.0,창원,Fashion,92.0,105.0,82.0,96.0
C005,정수빈,28.0,구로,Fashion,82.0,88.0,94.0,79.0
C006,윤태영,27.0,시흥,Electronics,88.0,95.0,85.0,91.0
C007,한유진,31.0,파주,,68.0,65.0,77.0,83.0
C008,강민호,29.0,일산,Sports,98.0,125.0,96.0,90.0
C009,,,,,,,,


### 중복 데이터 처리
**중복값 확인 `duplicated()`**

In [None]:
data = {
    'name' : ['Ailce', 'Bob', 'Alice', 'David'],
    'age' : [20,30,20,40]
}
df=pd.DataFrame(data)

**중복값 제거 `drop_duplicates()`**

In [21]:
print(df.duplicates())

AttributeError: 'DataFrame' object has no attribute 'duplicates'

In [None]:
df = df.drop_dupicates()
df

AttributeError: 'DataFrame' object has no attribute 'drop_dupicates'

### 데이터 정렬

**값을 기준으로 정렬 `sort_values()`**
- 특정 열의 값을 기준으로 데이터를 정렬할 수 있다.
- 기본적으로 오름차순으로 정렬되며, 내림차순은 accending=False 옵션을 사용한다.

In [23]:
import pandas as pd
df = pd.read_csv('data/customers3.csv', index_col='고객ID')
df

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35,시흥,Electronics,75,70,78,92
C002,이영희,30,안양,fashion,85,92,88,85
C003,박지성,35,울산,,95,110,91,88
C004,최민아,40,창원,Fashion,92,105,82,96
C005,정수빈,28,구로,Fashion,82,88,94,79
C006,윤태영,27,시흥,Electronics,88,95,85,91
C007,한유진,31,파주,,68,65,77,83
C008,강민호,29,일산,Sports,98,125,96,90


In [22]:
import pandas as pd
df = pd.read_csv('data/customers3.csv', index_col='고객ID')
df

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C001,박태근,35,시흥,Electronics,75,70,78,92
C002,이영희,30,안양,fashion,85,92,88,85
C003,박지성,35,울산,,95,110,91,88
C004,최민아,40,창원,Fashion,92,105,82,96
C005,정수빈,28,구로,Fashion,82,88,94,79
C006,윤태영,27,시흥,Electronics,88,95,85,91
C007,한유진,31,파주,,68,65,77,83
C008,강민호,29,일산,Sports,98,125,96,90


In [24]:
df.sort_values(by='나이')

Unnamed: 0_level_0,고객명,나이,거주도시,주요관심사,최근1년_방문빈도,평균구매액(만원),고객만족도(점),재구매의사(점)
고객ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C006,윤태영,27,시흥,Electronics,88,95,85,91
C005,정수빈,28,구로,Fashion,82,88,94,79
C008,강민호,29,일산,Sports,98,125,96,90
C002,이영희,30,안양,fashion,85,92,88,85
C007,한유진,31,파주,,68,65,77,83
C001,박태근,35,시흥,Electronics,75,70,78,92
C003,박지성,35,울산,,95,110,91,88
C004,최민아,40,창원,Fashion,92,105,82,96


**인덱스를 기준으로 정렬 `sort_index()`**