### 생선 분류
- 데이터셋 : fish.csv
- 피쳐/특성 : Weight, Lenght
- 타겟/라벨 : Species
- 학습방법 : 지도학습 => 분류
- 학습알고리즘 : 최근접이웃알고리즘

[1] 데이터 준비

In [4]:
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
DATA_FILE = '../data/fish.csv'

In [6]:
# 행 : Bream, Smelt, 컬럼 : Species, Weight, Height=> 0, 1, 2
fishDF=pd.read_csv(DATA_FILE,usecols=[0,1,2])
fishDF.head(3)

Unnamed: 0,Species,Weight,Length
0,Bream,242.0,25.4
1,Bream,290.0,26.3
2,Bream,340.0,26.5


In [7]:
mask=(fishDF['Species'] == 'Bream')| (fishDF['Species'] == 'Smelt')
twoDF=fishDF[mask]
twoDF.reset_index(drop=True,inplace=True)

In [8]:
twoDF.index

RangeIndex(start=0, stop=49, step=1)

In [9]:
# Species 컬럼을 수치화 => Bream 0, Smelt 1
twoDF['Fcode']=twoDF.loc[:,'Species'].replace({'Bream':0,'Smelt':1})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  twoDF['Fcode']=twoDF.loc[:,'Species'].replace({'Bream':0,'Smelt':1})


In [10]:
twoDF.head(3)

Unnamed: 0,Species,Weight,Length,Fcode
0,Bream,242.0,25.4,0
1,Bream,290.0,26.3,0
2,Bream,340.0,26.5,0


[2] 피쳐와 타켓 분리

In [11]:
features=twoDF[['Weight','Length']]
target=twoDF['Fcode']

In [12]:
print(f'fearures => {features.shape},{features.ndim}D')
print(f'target=> {target.shape},{target.ndim}D')

fearures => (49, 2),2D
target=> (49,),1D


[3] 학습용, 테스트용 데이트셋 준비

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
# train:test = 80:20 ===> test_size=0.2 또는 train_size=0.8
# stratify 매개변수 : 분류일 경우 사용, 분류 타겟의 종류에 대한 비율을 고려
X_train,X_test,y_train,y_test = train_test_split(features,target,
                                                 test_size=0.2,
                                                 stratify=target,
                                                 random_state=10)

In [15]:
print(f'X_train : {X_train.shape}, {X_train.ndim}D')
print(f'y_train : {y_train.shape}, {y_train.ndim}D')
print(f'X_test : {X_test.shape}, {X_test.ndim}D')
print(f'y_test : {y_test.shape}, {y_test.ndim}D')

X_train : (39, 2), 2D
y_train : (39,), 1D
X_test : (10, 2), 2D
y_test : (10,), 1D


In [16]:
# target 0(Bream), 1(Smelt)의 비율
y_train.head()

7     0
43    1
1     0
46    1
31    0
Name: Fcode, dtype: int64

In [17]:
y_train.value_counts()[0]/y_train.shape[0],y_train.value_counts()[1]/y_train.shape[0]

(0.717948717948718, 0.28205128205128205)

In [18]:
y_test.value_counts()[0]/y_test.shape[0],y_test.value_counts()[1]/y_test.shape[0]

(0.7, 0.3)

[3-2] 피쳐 스케일링

In [19]:
from sklearn.preprocessing import MinMaxScaler

In [20]:
#스케일러 인스턴스 생성
mmScaler = MinMaxScaler()

In [21]:
# 데이터에 기반한 MinMaxScaler 동작을 위한 학습 진행
mmScaler.fit(X_train)

In [38]:
mmScaler.min_,mmScaler.data_min_,mmScaler.scale_, mmScaler.data_max_

(array([-0.00674519, -0.31410256]),
 array([6.7, 9.8]),
 array([0.00100675, 0.03205128]),
 array([1000.,   41.]))

In [42]:
# 학습용 데이터셋 ==> 스케일링 ==>ndarray 타입 반환
X_train_scaled=mmScaler.transform(X_train)
X_train_scaled.shape

(39, 2)

In [43]:
# 테스트용 데이터셋 => 스케일링 ==>ndarray 타입 반환
X_test_scaled=mmScaler.transform(X_test)
X_test_scaled.shape, X_test_scaled.min(), X_test_scaled.max()

((10, 2), 0.0033222591362126247, 0.8489882210812445)

[4] 훈련/학습 진행
- 학습 알고리즘 인스턴스 생성
- 학습진행=>fit()

In [44]:
from sklearn.neighbors import KNeighborsClassifier

In [45]:
# 인스턴스 생성
model=KNeighborsClassifier()

In [60]:
# 학습 진행 =>학습용 데이터셋
model.fit(X_train_scaled,y_train)

In [62]:
# 학습 후 모델 파라미터
model.classes_, model.n_samples_fit_
#model.feature_names_in_ <== ndarray일 경우 컬럼명 X

(array([0, 1], dtype=int64), 39)

[5] 모델 성능평가 ==> score() 메서드 + 테스트 데이터셋

In [63]:
model.score(X_test_scaled,y_test)

1.0

[6] 예측 하기 ==>학습/훈련과 테스트에 사용되지 않은 데이터 사용
- 주의 : 입력 데이터 ==> 2D 

In [65]:
new_data=pd.DataFrame([[413,27.8]],columns=['Weight','Length'])
new_data

Unnamed: 0,Weight,Length
0,413,27.8


In [67]:
new_data_scaled=mmScaler.transform(new_data)
new_data_scaled

array([[0.40904057, 0.57692308]])

In [68]:
# 임의의 새로운 데이터의 예측
model.predict(new_data_scaled)

array([0], dtype=int64)

In [69]:
### 최근접한 k개 데이터 찾기
distance, index=model.kneighbors(new_data_scaled)

In [70]:
distance

array([[0.04209753, 0.06334927, 0.07138647, 0.07421737, 0.07974703]])

In [71]:
index

array([[25, 22, 21,  0,  6]], dtype=int64)

In [72]:
neighbors=index.reshape(-1).tolist()
neighbors

[25, 22, 21, 0, 6]

In [77]:
X_train_scaled[neighbors]

array([[0.42615524, 0.61538462],
       [0.35870331, 0.61538462],
       [0.44629014, 0.63782051],
       [0.38588543, 0.6474359 ],
       [0.44629014, 0.6474359 ]])

In [78]:
X_train_scaled[neighbors][0]

array([0.42615524, 0.61538462])

In [79]:
k_weight=X_train_scaled[neighbors][:,0]
k_length=X_train_scaled[neighbors][:,1]

print(k_weight,k_weight,sep='\n')

[0.42615524 0.35870331 0.44629014 0.38588543 0.44629014]
[0.42615524 0.35870331 0.44629014 0.38588543 0.44629014]
