### 결측치 : missing value (누락값)
+ 데이터 수집과정에서 채워지지 못한 값을 의미
+ 예를 들어 설문조사시 설문자가 특정문항에
    - 답을 하지 않으면 그 문항이 결측치가 됨
+ 데이터에 결측치가 포함되어 있으면
    - 데이터 분석시 편향/왜곡된 결과가 도출될 수 있음
+ 해결책 : 제거하거나 추정값으로 대체

In [1]:
x <- c(1,2,3,NA,5,NA,7)
sum(x)

# NA 확인 => TRUE
is.na(x)

# 결측치 갯수 확인
sum(is.na(x))

# 빈도표 결측치 확인
table(is.na(x))


FALSE  TRUE 
    5     2 

### 결측치 처리 : 제거

In [2]:
sum(x, na.rm=T) # NA 제외후 계산

In [3]:
# NA 제거
na.omit(x)

# NA 제거 후 새로운 벡터로 저장
x2 <- as.vector(na.omit(x))
x2

### 결측치 처리 : 대체

In [4]:
mean <- mean(x, na.rm=T)

# boolean 인덱싱을 이용해서 NA 요소를 찾은 후 평균값으로 대체
x[is.na(x)] <- mean
x

### 우편번호 데이터를 읽어오고 결측치 처리함

In [5]:
# read.csv(파일명, 헤더여부, 구분자, 범주형처리, 결측치처리)
zipcode = read.csv('rd1/zipcode_2013.txt', header=T, sep='\t')

head(zipcode)

Unnamed: 0_level_0,ZIPCODE,SIDO,GUGUN,DONG,RI,BUNJI,SEQ
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<int>
1,135-806,서울,강남구,개포1동,경남아파트,,1
2,135-807,서울,강남구,개포1동,우성3차아파트,(1∼6동),2
3,135-806,서울,강남구,개포1동,우성9차아파트,(901∼902동),3
4,135-770,서울,강남구,개포1동,주공아파트,(1∼16동),4
5,135-805,서울,강남구,개포1동,주공아파트,(17∼40동),5
6,135-966,서울,강남구,개포1동,주공아파트,(41∼85동),6


In [6]:
## titanic데이터 결측치 처리
titanic <- read.csv('csv/titanic2.csv',na.strings=c('','NA'))

In [7]:
str(titanic)

'data.frame':	1306 obs. of  13 variables:
 $ pclass  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
 $ name    : Factor w/ 1304 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
 $ ticket  : Factor w/ 927 levels "110152","110413",..: 187 49 49 49 49 124 92 16 76 824 ...
 $ fare    : num  211 152 152 152 152 ...
 $ embarked: Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
 $ life    : Factor w/ 2 levels "dead","live": 2 2 1 1 1 2 2 1 2 1 ...
 $ seat    : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ port    : Factor w/ 3 levels "cherbourg","qeenstown",..: 3 3 3 3 3 3 3 3 3 1 ...


In [8]:
summary(titanic)

     pclass         survived                                    name     
 Min.   :1.000   Min.   :0.0000   Connolly, Miss. Kate            :   2  
 1st Qu.:2.000   1st Qu.:0.0000   Kelly, Mr. James                :   2  
 Median :3.000   Median :0.0000   Abbing, Mr. Anthony             :   1  
 Mean   :2.296   Mean   :0.3813   Abbott, Master. Eugene Joseph   :   1  
 3rd Qu.:3.000   3rd Qu.:1.0000   Abbott, Mr. Rossmore Edward     :   1  
 Max.   :3.000   Max.   :1.0000   Abbott, Mrs. Stanton (Rosa Hunt):   1  
                                  (Other)                         :1298  
     sex           age              sibsp         parch             ticket    
 female:464   Min.   : 0.1667   Min.   :0.0   Min.   :0.0000   CA. 2343:  11  
 male  :842   1st Qu.:22.0000   1st Qu.:0.0   1st Qu.:0.0000   1601    :   8  
              Median :29.8811   Median :0.0   Median :0.0000   CA 2144 :   8  
              Mean   :29.8269   Mean   :0.5   Mean   :0.3859   3101295 :   7  
             

In [9]:
head(titanic)

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,life,seat,port
Unnamed: 0_level_1,<int>,<int>,<fct>,<fct>,<dbl>,<int>,<int>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>
1,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,S,live,1st,southampthon
2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,S,live,1st,southampthon
3,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,S,dead,1st,southampthon
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,S,dead,1st,southampthon
5,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,S,dead,1st,southampthon
6,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,S,live,1st,southampthon


In [10]:
colSums(is.na(titanic))

In [11]:
# age:264, cabin:1015
# age : 다른값으로 대체값
# cabin : 컬럼 자체 제거

# cabin 컬럼 제거
titanic$cabin <- NULL

In [12]:
# age 컬럼의 결측치는 중앙값으로 대체
md <- median(titanic$age, na.rm=T)
titanic$age[is.na(titanic$age)] <- md

In [13]:
# embarked의 결측치가 3개이므로 이것을 기준으로 제거함
titanic <- na.omit(titanic)

In [14]:
# 최종 확인
colSums(is.na(titanic))

In [15]:
# 최종결과 파일로 저장
save(titanic, file='rdata/titanic.rdata')