<a href="https://colab.research.google.com/github/iceman67/-Python/blob/master/missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 데이터 전처리와 결측치
* 모든 데이터 분석 프로젝트에서 데이터 전처리는 반드시 거쳐야 하는 과정
* 80% 시간을 데이터 수집 및 전처리에 사용

#### 결측치 처리 고려사항
- 결측치를 모두 제거할 경우, 막대한 데이터 손실을 불러일으킬 수 있음
- 결측치를 잘 못 대체할 경우, 데이터에서 편향(bias)이 생길 수가 있음


#### 결측치,  누락된 값, 비어 있는 값, NULL 값을 의미

* NA: Not Available (does not exist, missing)
*  Null: empty(null) object
*  NaN: Not a Number (python)


[참고자료](https://towardsdatascience.com/4-techniques-to-deal-with-missing-data-in-datasets-841f8a303395)


* 결측치에 따른 분석결과 왜곡 문제가 있음
* 실제 데이터는 데이터 수집 과정에서 발생한 오류 등으로 인해 결측치가 포함되어 있는 경우가 많기 때문에 이 결측치를 정제하는 과정이 필요



In [1]:
import pandas as pd
import numpy as np

In [136]:
df = pd.DataFrame({'x': [1, 2, np.nan, 3]})
df

Unnamed: 0,x
0,1.0
1,2.0
2,
3,3.0


In [129]:
#결측값 확인을 위해 isnull() 또는 isna() 함수를 이용, 결측값은 True로 표시됨
df['x'].isnull()
df['x'].isna()

0    False
1    False
2     True
3    False
Name: x, dtype: bool

In [130]:
# 결측값이 아님을 확인하기 위해 notna() 함수를 이용함 
df['x'].notna()

0     True
1     True
2    False
3     True
Name: x, dtype: bool

* 결측값이 아닌 행만 살펴보기 

In [131]:
df [df['x'].notna()]

Unnamed: 0,x
0,1.0
1,2.0
3,3.0


* 함수 dropna()를 이용하여 결측값이 있는 행을 모두 삭제함
> 함수호출 후에도 결측값은 데이터프레임에 존재함 

In [132]:
df.dropna()

Unnamed: 0,x
0,1.0
1,2.0
3,3.0


In [133]:
df

Unnamed: 0,x
0,1.0
1,2.0
2,
3,3.0


In [137]:
# 결측값을 데이터프레임에서 제외하고 이를 반영함
#df.dropna(inplace=True)

In [138]:
df

Unnamed: 0,x
0,1.0
1,2.0
2,
3,3.0


In [139]:
df['x'].sum()

6.0

In [140]:
df['x'].mean()

2.0

* 결측값 채우기

In [144]:
df = pd.DataFrame({'x': [1, 2, np.nan, 3]})

In [150]:
df['x'].fillna(0)

0    1.0
1    2.0
2    0.0
3    3.0
Name: x, dtype: float64

In [149]:
df['x'].fillna(99)

0     1.0
1     2.0
2    99.0
3     3.0
Name: x, dtype: float64

In [148]:
# 이전값으로 대치함
df['x'].ffill()

0    1.0
1    2.0
2    2.0
3    3.0
Name: x, dtype: float64

In [147]:
# 직후값으로 대치함
df['x'].bfill()

0    1.0
1    2.0
2    3.0
3    3.0
Name: x, dtype: float64

In [152]:
# 전후의 값의 평균으로 대치
df['x'].interpolate()


0    1.0
1    2.0
2    2.5
3    3.0
Name: x, dtype: float64

* 10개 행 x 6개 컬럼의 데이터프레임 생성

In [2]:
df = pd.DataFrame(
    {
        "Date" : pd.date_range(start="2021-11-11", periods=10,freq="D"),
        "Item" : 1014,
        "Measure_1": np.random.randint(1,10, size=10),
        "Measure_2": np.random.random(10).round(2),
        "Measure_3": np.random.random(10).round(2),
        "Measure_4": np.random.randn(10)
         
    }
)
df

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
0,2021-11-11,1014,8,0.28,0.94,1.236532
1,2021-11-12,1014,5,0.96,0.23,0.430046
2,2021-11-13,1014,7,0.72,0.85,-0.193806
3,2021-11-14,1014,8,0.4,0.11,1.264676
4,2021-11-15,1014,5,0.78,0.49,-0.073902
5,2021-11-16,1014,5,0.03,0.09,1.308957
6,2021-11-17,1014,9,0.09,0.02,0.883758
7,2021-11-18,1014,3,0.02,0.32,0.961203
8,2021-11-19,1014,3,0.31,0.75,0.020584
9,2021-11-20,1014,1,0.96,0.87,-0.0906


* 결측값 추가

In [3]:
df.loc[ [2,9], "Item"] = np.nan
df.loc[ [2,7,9], "Measure_1"] = np.nan
df.loc [ [2,3], "Measure_2"] = np.nan
df.loc[ [2], "Measure_3" ] = np.nan
df.loc [:6, "Measure_4"] = np.nan

df

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
0,2021-11-11,1014.0,8.0,0.28,0.94,
1,2021-11-12,1014.0,5.0,0.96,0.23,
2,2021-11-13,,,,,
3,2021-11-14,1014.0,8.0,,0.11,
4,2021-11-15,1014.0,5.0,0.78,0.49,
5,2021-11-16,1014.0,5.0,0.03,0.09,
6,2021-11-17,1014.0,9.0,0.09,0.02,
7,2021-11-18,1014.0,,0.02,0.32,0.961203
8,2021-11-19,1014.0,3.0,0.31,0.75,0.020584
9,2021-11-20,,,0.96,0.87,-0.0906


결측치 추가에 따라 Item, Measure_1 의 자료형을 int64로 수정함
> $<NA>$ 는 결측치를 표현함 

In [4]:
df = df.astype (
    {
        "Item" : pd.Int64Dtype(),
        "Measure_1" : pd.Int64Dtype() 
    }
)

df

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
0,2021-11-11,1014.0,8.0,0.28,0.94,
1,2021-11-12,1014.0,5.0,0.96,0.23,
2,2021-11-13,,,,,
3,2021-11-14,1014.0,8.0,,0.11,
4,2021-11-15,1014.0,5.0,0.78,0.49,
5,2021-11-16,1014.0,5.0,0.03,0.09,
6,2021-11-17,1014.0,9.0,0.09,0.02,
7,2021-11-18,1014.0,,0.02,0.32,0.961203
8,2021-11-19,1014.0,3.0,0.31,0.75,0.020584
9,2021-11-20,,,0.96,0.87,-0.0906


### 1. 결측치  값을 제거함
*  결측치를 갖는 행과 열을 모두 삭제함

In [5]:
df.dropna()

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
8,2021-11-19,1014,3,0.31,0.75,0.020584


* 결측치를 갖는 모든 컬럼을 삭제함

In [6]:
df.dropna(axis=1)

Unnamed: 0,Date
0,2021-11-11
1,2021-11-12
2,2021-11-13
3,2021-11-14
4,2021-11-15
5,2021-11-16
6,2021-11-17
7,2021-11-18
8,2021-11-19
9,2021-11-20


* 결측치를 모두 갖는  행 또는 열을 삭제함

In [7]:
df.dropna(how="all")

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
0,2021-11-11,1014.0,8.0,0.28,0.94,
1,2021-11-12,1014.0,5.0,0.96,0.23,
2,2021-11-13,,,,,
3,2021-11-14,1014.0,8.0,,0.11,
4,2021-11-15,1014.0,5.0,0.78,0.49,
5,2021-11-16,1014.0,5.0,0.03,0.09,
6,2021-11-17,1014.0,9.0,0.09,0.02,
7,2021-11-18,1014.0,,0.02,0.32,0.961203
8,2021-11-19,1014.0,3.0,0.31,0.75,0.020584
9,2021-11-20,,,0.96,0.87,-0.0906


* 임계값 (threshold) 보다 큰 행을 제거함
> 2번 행이 삭제됨을 확인

In [8]:
df.dropna(thresh=4)
df

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
0,2021-11-11,1014.0,8.0,0.28,0.94,
1,2021-11-12,1014.0,5.0,0.96,0.23,
2,2021-11-13,,,,,
3,2021-11-14,1014.0,8.0,,0.11,
4,2021-11-15,1014.0,5.0,0.78,0.49,
5,2021-11-16,1014.0,5.0,0.03,0.09,
6,2021-11-17,1014.0,9.0,0.09,0.02,
7,2021-11-18,1014.0,,0.02,0.32,0.961203
8,2021-11-19,1014.0,3.0,0.31,0.75,0.020584
9,2021-11-20,,,0.96,0.87,-0.0906


* 특정 컬럼의 결측치를 제거함

In [9]:
df.dropna(subset =["Measure_2", "Measure_3"])

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
0,2021-11-11,1014.0,8.0,0.28,0.94,
1,2021-11-12,1014.0,5.0,0.96,0.23,
4,2021-11-15,1014.0,5.0,0.78,0.49,
5,2021-11-16,1014.0,5.0,0.03,0.09,
6,2021-11-17,1014.0,9.0,0.09,0.02,
7,2021-11-18,1014.0,,0.02,0.32,0.961203
8,2021-11-19,1014.0,3.0,0.31,0.75,0.020584
9,2021-11-20,,,0.96,0.87,-0.0906


## 결측치 대치
* 상수값으로 대치

In [10]:
values ={ "Item": 1014, "Measure_1" : 0}
df.fillna(value=values)

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
0,2021-11-11,1014,8,0.28,0.94,
1,2021-11-12,1014,5,0.96,0.23,
2,2021-11-13,1014,0,,,
3,2021-11-14,1014,8,,0.11,
4,2021-11-15,1014,5,0.78,0.49,
5,2021-11-16,1014,5,0.03,0.09,
6,2021-11-17,1014,9,0.09,0.02,
7,2021-11-18,1014,0,0.02,0.32,0.961203
8,2021-11-19,1014,3,0.31,0.75,0.020584
9,2021-11-20,1014,0,0.96,0.87,-0.0906


* 특정 컬럼의 대표값 (평균) 으로 대치

In [11]:
df["Measure_2"].mean()

0.42874999999999996

In [12]:
df["Measure_2"].fillna(df["Measure_2"].median())

0    0.280
1    0.960
2    0.295
3    0.295
4    0.780
5    0.030
6    0.090
7    0.020
8    0.310
9    0.960
Name: Measure_2, dtype: float64

* 결측치의 전값 또는 후 값으로 대치함

In [13]:
df.fillna(method="bfill")

Unnamed: 0,Date,Item,Measure_1,Measure_2,Measure_3,Measure_4
0,2021-11-11,1014.0,8.0,0.28,0.94,0.961203
1,2021-11-12,1014.0,5.0,0.96,0.23,0.961203
2,2021-11-13,1014.0,8.0,0.78,0.11,0.961203
3,2021-11-14,1014.0,8.0,0.78,0.11,0.961203
4,2021-11-15,1014.0,5.0,0.78,0.49,0.961203
5,2021-11-16,1014.0,5.0,0.03,0.09,0.961203
6,2021-11-17,1014.0,9.0,0.09,0.02,0.961203
7,2021-11-18,1014.0,3.0,0.02,0.32,0.961203
8,2021-11-19,1014.0,3.0,0.31,0.75,0.020584
9,2021-11-20,,,0.96,0.87,-0.0906


## (결측치) 실습

In [14]:
data = {
    'Fruit' : ['Apple',None,'Banana','',np.nan,'Strawberry','Banana','Banana','Apple'],
    'Age' : [np.nan, 14, 13, 22, 14, np.nan, 31, np.nan,None],
    'Height' : [187,181,155,165,177,171,170,179,164]
}

In [15]:
data

{'Age': [nan, 14, 13, 22, 14, nan, 31, nan, None],
 'Fruit': ['Apple',
  None,
  'Banana',
  '',
  nan,
  'Strawberry',
  'Banana',
  'Banana',
  'Apple'],
 'Height': [187, 181, 155, 165, 177, 171, 170, 179, 164]}

In [80]:
df = pd.DataFrame(data)

In [81]:
df

Unnamed: 0,Fruit,Age,Height
0,Apple,,187
1,,14.0,181
2,Banana,13.0,155
3,,22.0,165
4,,14.0,177
5,Strawberry,,171
6,Banana,31.0,170
7,Banana,,179
8,Apple,,164


### 결측치 확인

In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Fruit   7 non-null      object 
 1   Age     5 non-null      float64
 2   Height  9 non-null      int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 344.0+ bytes


In [83]:
df.isnull().sum()

Fruit     2
Age       4
Height    0
dtype: int64

In [84]:
df.isna().any()

Fruit      True
Age        True
Height    False
dtype: bool

#### Fruit 컬럼에서 결측치 행을 확인

In [85]:
df.columns


Index(['Fruit', 'Age', 'Height'], dtype='object')

In [86]:
df['Fruit'].isnull()

0    False
1     True
2    False
3    False
4     True
5    False
6    False
7    False
8    False
Name: Fruit, dtype: bool

In [87]:
df [df['Fruit'].isnull()]

Unnamed: 0,Fruit,Age,Height
1,,14.0,181
4,,14.0,177


#### Fruit  컬럼에 빈문자열이 있음을 확인함

In [75]:
def is_emptystring(x):
    return x.eq('').any()
 
df.apply(lambda x:is_emptystring(x))

stnId        False
stnNm        False
tm           False
avgTa        False
minTa        False
             ...  
sumLrgEv     False
sumSmlEv     False
n99Rn        False
iscs         False
sumFogDur    False
Length: 62, dtype: bool

### 결측치 대치

In [88]:
df.fillna({'Fruit':df['Fruit'].mode()[0],'Age':int(df['Age'].mean())},inplace=True)

> Fruit의 결측치는 최빈값으로, Age의 결측치는 평균값으로 대치함

In [89]:
df

Unnamed: 0,Fruit,Age,Height
0,Apple,18.0,187
1,Banana,14.0,181
2,Banana,13.0,155
3,,22.0,165
4,Banana,14.0,177
5,Strawberry,18.0,171
6,Banana,31.0,170
7,Banana,18.0,179
8,Apple,18.0,164


In [90]:
df = df.replace(' ','').replace('',df['Fruit'].mode()[0])

In [91]:
df

Unnamed: 0,Fruit,Age,Height
0,Apple,18.0,187
1,Banana,14.0,181
2,Banana,13.0,155
3,Banana,22.0,165
4,Banana,14.0,177
5,Strawberry,18.0,171
6,Banana,31.0,170
7,Banana,18.0,179
8,Apple,18.0,164


> 빈문자열을 최빈값으로 대치함

### 결측치 제거

In [92]:
data = {
    'Fruit' : ['Apple',None,'Banana','',np.nan,'Strawberry','Banana','Banana','Apple'],
    'Age' : [np.nan, 14, 13, 22, 14, np.nan, 31, np.nan,None],
    'Height' : [187,181,155,165,177,171,170,179,164]
}
df = pd.DataFrame(data)

In [93]:
df

Unnamed: 0,Fruit,Age,Height
0,Apple,,187
1,,14.0,181
2,Banana,13.0,155
3,,22.0,165
4,,14.0,177
5,Strawberry,,171
6,Banana,31.0,170
7,Banana,,179
8,Apple,,164


In [94]:
# 컬럼에 관계없이 모든 결측치 제거
df = df.dropna()

In [95]:
df.isnull().sum()

Fruit     0
Age       0
Height    0
dtype: int64

In [96]:
df

Unnamed: 0,Fruit,Age,Height
2,Banana,13.0,155
3,,22.0,165
6,Banana,31.0,170


In [97]:
# Fruit 컬럼의 결측치 제거 (공백문자 삭제가 필요함)
df.dropna(subset=['Fruit'])

Unnamed: 0,Fruit,Age,Height
2,Banana,13.0,155
3,,22.0,165
6,Banana,31.0,170


In [98]:
df.query('Fruit != ""')

Unnamed: 0,Fruit,Age,Height
2,Banana,13.0,155
6,Banana,31.0,170


In [99]:
# 결측치 열단위 제거
df.dropna(axis=1)

Unnamed: 0,Fruit,Age,Height
2,Banana,13.0,155
3,,22.0,165
6,Banana,31.0,170


In [100]:
# 빈문자열이 포함된  Fruit 컬럼 삭제
def is_emptystring(x):
    return x.eq('').any()
 
res = df.apply(lambda x:is_emptystring(x))
 
## 빈문자열을 포함하지 않는 칼럼이름을 리스트에 담는다.
valid_column = [i for v, i in zip(res.values, res.index) if v == False] 
df[valid_column]

Unnamed: 0,Age,Height
2,13.0,155
3,22.0,165
6,31.0,170


### 결측치를 'missing'으로 대치

In [125]:
df_missing = df.fillna('missing')
df_missing

Unnamed: 0,Fruit,Age,Height
2,Banana,13.0,155
3,,22.0,165
6,Banana,31.0,170


In [None]:
# 결측치를 평균으로 대치
df.fillna(df.mean()), df.where(pd.notnull(df), df.mean(), axis='columns')

  


(    Fruit   Age  Height
 2  Banana  13.0     155
 3          22.0     165
 6  Banana  31.0     170,     Fruit   Age  Height
 2  Banana  13.0     155
 3          22.0     165
 6  Banana  31.0     170)

In [None]:
mpg_sample = pd.read_csv(file_path)

In [None]:
# dropna() 메서드를 사용하면 결측 데이터가 존재하는 행이나 열을 지울 수  있음
mpg_sample.dropna()

Unnamed: 0,class,cty,hwy
0,compact,18.0,29.0
1,compact,21.0,29.0
3,compact,21.0,30.0
4,compact,16.0,26.0
5,compact,18.0,26.0
...,...,...,...
228,midsize,18.0,29.0
229,midsize,19.0,28.0
230,midsize,999.0,29.0
231,midsize,16.0,26.0


In [None]:
mpg_sample.dropna(how='any')

Unnamed: 0,class,cty,hwy
0,compact,18.0,29.0
1,compact,21.0,29.0
3,compact,21.0,30.0
4,compact,16.0,26.0
5,compact,18.0,26.0
...,...,...,...
228,midsize,18.0,29.0
229,midsize,19.0,28.0
230,midsize,999.0,29.0
231,midsize,16.0,26.0


### 결측치 실습

* 데이터 불러오기와 결측치 확인

In [101]:
file_path = 'https://raw.githubusercontent.com/ark1st/Doit_R_ARKS_CODE/master/sample_NoNA.csv'

In [102]:
mpg_sample = pd.read_csv(file_path)

In [103]:
mpg_sample.head()

Unnamed: 0,class,cty,hwy
0,compact,18.0,29.0
1,compact,21.0,29.0
2,compact,,31.0
3,compact,21.0,30.0
4,compact,16.0,26.0


In [104]:
mpg_sample.shape

(234, 3)

In [105]:
# 결측치 확인
mpg_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   class   219 non-null    object 
 1   cty     226 non-null    float64
 2   hwy     229 non-null    float64
dtypes: float64(2), object(1)
memory usage: 5.6+ KB


In [106]:
mpg_sample.isna().any()

class    True
cty      True
hwy      True
dtype: bool

In [107]:
mpg_sample.isna().sum()

class    15
cty       8
hwy       5
dtype: int64

In [108]:
# 결측치 위치 확인
mpg_sample.isnull()

Unnamed: 0,class,cty,hwy
0,False,False,False
1,False,False,False
2,False,True,False
3,False,False,False
4,False,False,False
...,...,...,...
229,False,False,False
230,False,False,False
231,False,False,False
232,False,False,False


In [109]:
mpg_sample.notnull().sum()

class    219
cty      226
hwy      229
dtype: int64

### groupby()  

In [112]:
class_group = mpg_sample.groupby('class')
class_group.first()

Unnamed: 0_level_0,cty,hwy
class,Unnamed: 1_level_1,Unnamed: 2_level_1
2seater,16.0,26.0
compact,18.0,29.0
midsize,15.0,24.0
minivan,18.0,24.0
pickup,15.0,19.0
subcompact,18.0,26.0
suv,11.0,15.0


In [113]:
class_group = mpg_sample.groupby('class').mean()
class_group

Unnamed: 0_level_0,cty,hwy
class,Unnamed: 1_level_1,Unnamed: 2_level_1
2seater,15.4,24.8
compact,20.0,502.97619
midsize,70.315789,27.351351
minivan,15.818182,22.363636
pickup,13.193548,49.666667
subcompact,49.939394,28.142857
suv,12.32,19.666667


### 결측치를 0으로 대치함

In [116]:
df_0 = mpg_sample.fillna(0)

In [117]:
df_0

Unnamed: 0,class,cty,hwy
0,compact,18.0,29.0
1,compact,21.0,29.0
2,compact,0.0,31.0
3,compact,21.0,30.0
4,compact,16.0,26.0
...,...,...,...
229,midsize,19.0,28.0
230,midsize,999.0,29.0
231,midsize,16.0,26.0
232,midsize,18.0,26.0


## 수집된 공공데이터 ASOS 에 결측치 처리

In [55]:
import pandas as pd
import requests
import json


In [56]:
serviceKey = 'y80jnESQZu1%2B%2BKrpWpkGrnZ96%2FhiBicuIH%2F3SeO0u10CK9rglO3nqmwetj8%2BRHj%2F1NWUUis4aeGnUMk1CFUYRQ%3D%3D'
numOfRows = 31
startDt = 20210801
endDt = 20210831
stnId =108 # 지역
URL=f"http://apis.data.go.kr/1360000/AsosDalyInfoService/getWthrDataList?serviceKey={serviceKey}&pageNo=1&numOfRows={numOfRows}&dataType=JSON&dataCd=ASOS&dateCd=DAY&startDt={startDt}&endDt={endDt}&stnIds={stnId}"


In [57]:
def set_url(key, numOfRows, startDt, endDt, stnId):
    base = "http://apis.data.go.kr/1360000/AsosDalyInfoService/getWthrDataList"
    url =f"{base}?serviceKey={serviceKey}&pageNo=1&numOfRows={numOfRows}&dataType=JSON&dataCd=ASOS&dateCd=DAY&startDt={startDt}&endDt={endDt}&stnIds={stnId}"

    return url


In [58]:
URL = set_url(serviceKey, numOfRows, startDt, endDt, stnId)


* 주어진 URL 의 JSON 자료를 데이터프레임으로 구성함

In [59]:
def gen_df(URL):
    result = requests.get(URL)
    js = json.loads(result.content)
    data = pd.DataFrame(js['response']['body']['items']['item'])
    return data
df = gen_df(URL)

* 자료값 확인을 위해 컬럼 번호와 자료형 (Dtype) 을 확인함

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 62 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   stnId           31 non-null     object
 1   stnNm           31 non-null     object
 2   tm              31 non-null     object
 3   avgTa           31 non-null     object
 4   minTa           31 non-null     object
 5   minTaHrmt       31 non-null     object
 6   maxTa           31 non-null     object
 7   maxTaHrmt       31 non-null     object
 8   mi10MaxRn       31 non-null     object
 9   mi10MaxRnHrmt   31 non-null     object
 10  hr1MaxRn        31 non-null     object
 11  hr1MaxRnHrmt    31 non-null     object
 12  sumRnDur        31 non-null     object
 13  sumRn           31 non-null     object
 14  maxInsWs        31 non-null     object
 15  maxInsWsWd      31 non-null     object
 16  maxInsWsHrmt    31 non-null     object
 17  maxWs           31 non-null     object
 18  maxWsWd     

* 결측치를 확인함

In [61]:
df.isnull().sum()

stnId        0
stnNm        0
tm           0
avgTa        0
minTa        0
            ..
sumLrgEv     0
sumSmlEv     0
n99Rn        0
iscs         0
sumFogDur    0
Length: 62, dtype: int64

In [62]:
# n99Rn 에 값을 확인함
df.iloc[3, 59:60]

n99Rn    
Name: 3, dtype: object

* n99n 컬럼을 숫자로 변환한 후 결측치를 확인함

In [63]:
df['n99Rn'] = pd.to_numeric(df['n99Rn'])
df

Unnamed: 0,stnId,stnNm,tm,avgTa,minTa,minTaHrmt,maxTa,maxTaHrmt,mi10MaxRn,mi10MaxRnHrmt,...,avgM05Te,avgM10Te,avgM15Te,avgM30Te,avgM50Te,sumLrgEv,sumSmlEv,n99Rn,iscs,sumFogDur
0,108,서울,2021-08-01,27.1,25.1,2345,28.8,1237,7.3,1917.0,...,28.7,26.0,24.4,17.8,15.2,1.3,1.9,22.7,{비}0525-0540. {비}0610-0635. {비}0750-0820. {비}0...,
1,108,서울,2021-08-02,26.5,25.0,17,28.6,1725,0.4,752.0,...,28.1,26.0,24.5,17.9,15.2,2.0,2.9,0.0,-{비}-0020. {비}0350-0415. {비}0505-{비}{강도0}0600-...,
2,108,서울,2021-08-03,28.0,24.1,538,31.4,1441,0.0,,...,27.7,26.0,24.6,17.9,15.3,2.8,4.0,0.0,{소나기}1604-1625.,
3,108,서울,2021-08-04,28.9,26.3,532,33.2,1426,,,...,27.7,25.9,24.6,18.0,15.3,4.4,6.4,,,
4,108,서울,2021-08-05,29.4,25.6,407,33.7,1511,,,...,28.0,25.9,24.7,18.1,15.3,4.5,6.4,,,
5,108,서울,2021-08-06,28.1,26.0,2332,32.2,1304,,,...,28.1,25.9,24.7,18.2,15.4,4.0,5.7,,,
6,108,서울,2021-08-07,28.0,23.4,547,32.3,1616,,,...,28.0,26.0,24.7,18.2,15.4,4.8,6.8,,,
7,108,서울,2021-08-08,26.8,24.4,607,32.1,1324,7.8,1529.0,...,28.0,26.0,24.8,18.3,15.5,2.9,4.2,12.7,{소나기}1455-{소나기}{강도0}1500-{비}1725-{비}{강도0}1800-...,
8,108,서울,2021-08-09,28.3,23.3,455,33.6,1530,,,...,27.5,26.0,24.8,18.4,15.5,5.8,8.4,,,
9,108,서울,2021-08-10,27.7,24.3,549,32.2,1528,,,...,27.7,25.9,24.8,18.5,15.5,4.1,5.9,0.0,,


In [64]:
df.isna().any()

stnId        False
stnNm        False
tm           False
avgTa        False
minTa        False
             ...  
sumLrgEv     False
sumSmlEv     False
n99Rn         True
iscs         False
sumFogDur    False
Length: 62, dtype: bool

In [65]:
df['n99Rn'].head() 

0    22.7
1     0.0
2     0.0
3     NaN
4     NaN
Name: n99Rn, dtype: float64

In [66]:
df.iloc[3, 59:60]

n99Rn    NaN
Name: 3, dtype: object

In [67]:
df.isnull().sum()

stnId         0
stnNm         0
tm            0
avgTa         0
minTa         0
             ..
sumLrgEv      0
sumSmlEv      0
n99Rn        11
iscs          0
sumFogDur     0
Length: 62, dtype: int64

* 주어진 데이터프레임의 값을 확인한 후 자료에 값이 없는 경우 숫자로 변환하여 결측치를 확인하시오.

In [68]:
df['sumFogDur'].head()

0    
1    
2    
3    
4    
Name: sumFogDur, dtype: object

In [69]:
df['sumFogDur'] = pd.to_numeric(df['sumFogDur'])
df['sumFogDur']

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9      NaN
10     NaN
11     NaN
12     NaN
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
18     NaN
19     NaN
20    0.27
21     NaN
22     NaN
23     NaN
24     NaN
25     NaN
26     NaN
27     NaN
28     NaN
29     NaN
30     NaN
Name: sumFogDur, dtype: float64

In [70]:
df = df.dropna()
df

Unnamed: 0,stnId,stnNm,tm,avgTa,minTa,minTaHrmt,maxTa,maxTaHrmt,mi10MaxRn,mi10MaxRnHrmt,...,avgM05Te,avgM10Te,avgM15Te,avgM30Te,avgM50Te,sumLrgEv,sumSmlEv,n99Rn,iscs,sumFogDur
20,108,서울,2021-08-21,23.8,21.0,1312,25.5,1600,11.9,1252,...,26.8,25.5,24.9,19.1,16.0,1.3,1.9,62.0,{비}0620-{비}{강도0}0900-{비}{강도1}1200-1350. {박무}09...,0.27


#### 참고문헌 

* https://wikidocs.net/16582