### 목표 : 무게에 따른 길이를 예측해주는 모델
- 데이터 : fish.csv
- 피쳐/특성 : 무게
- 라벨/타겟 : 길이
- 학습 방법 : 지도학습 + 예측 ==> KNN기반의 회귀
- 학습/테스트 데이터 : 7:3 준비


#### (1) 모듈 로딩 및 데이터 준비 <hr>

In [385]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
import numpy as np

In [386]:
data_file = '../data/fish.csv'

In [387]:
fishDF = pd.read_csv(data_file, usecols=[0,1,2])
fishDF

Unnamed: 0,Species,Weight,Length
0,Bream,242.0,25.4
1,Bream,290.0,26.3
2,Bream,340.0,26.5
3,Bream,363.0,29.0
4,Bream,430.0,29.0
...,...,...,...
154,Smelt,12.2,12.2
155,Smelt,13.4,12.4
156,Smelt,12.2,13.0
157,Smelt,19.7,14.3


In [388]:
perchDF = fishDF[fishDF['Species']=='Perch']
perchDF = perchDF.reset_index(drop=True)

In [389]:
perchDF.info(), perchDF.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  56 non-null     object 
 1   Weight   56 non-null     float64
 2   Length   56 non-null     float64
dtypes: float64(2), object(1)
memory usage: 1.4+ KB


(None,
   Species  Weight  Length
 0   Perch     5.9     8.4
 1   Perch    32.0    13.7
 2   Perch    40.0    15.0
 3   Perch    51.5    16.2
 4   Perch    70.0    17.4)

In [390]:
perchDF.describe()

Unnamed: 0,Weight,Length
count,56.0,56.0
mean,382.239286,27.892857
std,347.617717,9.021668
min,5.9,8.4
25%,120.0,21.825
50%,207.5,25.3
75%,692.5,36.625
max,1100.0,44.0


#### (3) 데이터 전처리 <hr>
- 결측치, 이상치, 중복값 처리
- 데이터 분포, 컬럼 분포, 최빈값 고유값
- 산점도, 히스토그램

####  (3-1) 데이터 분포

#### (4) 학습 진행 <hr>
- 학습 방법 : 지도학습 + 회귀(예측) => 선형회귀 LinearRegression

In [391]:
# model = LinearRegression()

In [392]:
# model.fit(perchDF[['Weight']], perchDF['Length'])

In [393]:
featureDF = pd.DataFrame(perchDF[perchDF.columns[1]])
featureDF.head(2)
type(featureDF)

pandas.core.frame.DataFrame

In [394]:
targetDF = perchDF['Length']
targetDF.head(2)

0     8.4
1    13.7
Name: Length, dtype: float64

In [395]:
from sklearn.model_selection import train_test_split

In [396]:
X_train, X_test, y_train, y_test = train_test_split(featureDF,
                                                    targetDF,
                                                    test_size=0.3,
                                                    random_state=1234)
print(len(X_train), X_train)

39     Weight
48   900.0
10   110.0
7     80.0
51  1100.0
14   120.0
17   135.0
25   145.0
55  1000.0
32   265.0
1     32.0
37   514.0
54  1000.0
2     40.0
35   300.0
39   840.0
36   320.0
0      5.9
11   115.0
3     51.5
34   250.0
52  1000.0
9     85.0
16   130.0
5    100.0
28   197.0
46   820.0
45   650.0
43   690.0
30   300.0
26   188.0
41   700.0
23   170.0
49  1015.0
15   120.0
24   225.0
12   125.0
38   556.0
19   130.0
47   850.0


In [397]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor()

In [398]:
X_train

Unnamed: 0,Weight
48,900.0
10,110.0
7,80.0
51,1100.0
14,120.0
17,135.0
25,145.0
55,1000.0
32,265.0
1,32.0


In [399]:
y_train

48    40.0
10    21.0
7     19.0
51    42.0
14    22.0
17    22.0
25    24.0
55    44.0
32    27.5
1     13.7
37    32.8
54    43.5
2     15.0
35    28.7
39    35.0
36    30.0
0      8.4
11    21.0
3     16.2
34    28.0
52    43.0
9     20.0
16    22.0
5     18.0
28    25.6
46    39.0
45    39.0
43    37.0
30    27.3
26    24.6
41    36.0
23    23.5
49    40.0
15    22.0
24    24.0
12    21.0
38    34.5
19    22.5
47    40.0
Name: Length, dtype: float64

In [400]:
model.fit(pd.DataFrame(X_train), y_train)

In [401]:
model.score(pd.DataFrame(X_train), y_train), model.score(pd.DataFrame(X_test), y_test)

(0.9698063746915628, 0.9917356711359953)

In [402]:
X_train


Unnamed: 0,Weight
48,900.0
10,110.0
7,80.0
51,1100.0
14,120.0
17,135.0
25,145.0
55,1000.0
32,265.0
1,32.0


In [403]:
type(X_test)

pandas.core.frame.DataFrame

In [404]:
y_train

48    40.0
10    21.0
7     19.0
51    42.0
14    22.0
17    22.0
25    24.0
55    44.0
32    27.5
1     13.7
37    32.8
54    43.5
2     15.0
35    28.7
39    35.0
36    30.0
0      8.4
11    21.0
3     16.2
34    28.0
52    43.0
9     20.0
16    22.0
5     18.0
28    25.6
46    39.0
45    39.0
43    37.0
30    27.3
26    24.6
41    36.0
23    23.5
49    40.0
15    22.0
24    24.0
12    21.0
38    34.5
19    22.5
47    40.0
Name: Length, dtype: float64

In [405]:
y_pre = model.predict(X_test)
y_pre = np.round(y_pre,1)
y_pre

array([25.9, 39.6, 19.8, 22.8, 18.8, 38. , 42.5, 21.9, 20.8, 27.1, 37.1,
       27.1, 17.6, 22.8, 37.2, 23.9, 22.3])

In [406]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# r2_score(.Length, y_pre)
r2ret = r2_score(y_test, y_pre)

0.9916384264034183

In [407]:
mse = mean_squared_error(y_test, y_pre)


0.5341176470588237

In [408]:
rmse = mean_squared_error(y_test, y_pre, squared=False)


0.7308335289645814

In [409]:
mae = mean_absolute_error(y_test, y_pre)


0.5647058823529415

In [410]:
perchDF['Weight'].max()

1100.0

In [459]:
for i in range(10000):
    X_train, X_test, y_train, y_test = train_test_split(featureDF,
                                                        targetDF,
                                                        test_size=0.3,
                                                        random_state=i)
    for k in range(1,40):
        model = KNeighborsRegressor(n_neighbors=k)
        model.fit(pd.DataFrame(X_train), y_train)
        y_pre = model.predict(X_test)
        y_pre = np.round(y_pre,1)
        r2ret = r2_score(y_test, y_pre)
        mse = mean_squared_error(y_test, y_pre)
        rmse = mean_squared_error(y_test, y_pre, squared=False)
        mae = mean_absolute_error(y_test, y_pre)
        # if r2ret<1:
        if mse+rmse+mae <1.5:
            print(f'k={k}')
            print(f'random_state : {i}')
            print(f'훈련 점수 : {model.score(pd.DataFrame(X_train), y_train):.1f}, 테스트 점수 : {model.score(pd.DataFrame(X_test), y_test):.1f}')
            print(f'R2:{r2ret:.2f}, MSE:{mse:.2f}cm, RMSE:{rmse:.2f}cm, MAE:{mae:.2f}cm')

k=7
random_state : 176
훈련 점수 : 1.0, 테스트 점수 : 1.0
R2:0.99, MSE:0.34cm, RMSE:0.58cm, MAE:0.42cm
k=7
random_state : 462
훈련 점수 : 1.0, 테스트 점수 : 1.0
R2:0.99, MSE:0.40cm, RMSE:0.63cm, MAE:0.45cm
k=5
random_state : 5046
훈련 점수 : 1.0, 테스트 점수 : 1.0
R2:0.99, MSE:0.37cm, RMSE:0.61cm, MAE:0.51cm
k=10
random_state : 6001
훈련 점수 : 0.9, 테스트 점수 : 1.0
R2:0.99, MSE:0.37cm, RMSE:0.61cm, MAE:0.46cm
k=9
random_state : 8422
훈련 점수 : 0.9, 테스트 점수 : 1.0
R2:0.99, MSE:0.35cm, RMSE:0.59cm, MAE:0.51cm
k=7
random_state : 8718
훈련 점수 : 1.0, 테스트 점수 : 1.0
R2:0.99, MSE:0.35cm, RMSE:0.59cm, MAE:0.48cm
k=2
random_state : 9880
훈련 점수 : 1.0, 테스트 점수 : 1.0
R2:0.99, MSE:0.38cm, RMSE:0.62cm, MAE:0.44cm


#### (6) 테스트 <hr>
- 제대로 만들어진 모델인지 확인하는 과정
    - 훈련용 데이터에 대한 점수
    - 테스트용 데이터에 대한 점수
    - 훈련점수와 테스트점수 비교
        - 훈련점수 > 테스트점수 : 과대적합(Overfiting)
        - 훈련점수 ≒ 테스트점수 : 최적적합
        - 훈련점수▼,  테스트점수▼ : 과소적합(Underfiting)

#### (7) 성능평가

In [410]:
from sklearn.metrics import r2_score