# note
> 머신러닝 정리, 빅분기 대비

- toc: true
- branch: master
- badges: false
- comments: true
- author: pinkocto
- categories: [python]

## 기계학습
- 학습 Train / 검증 Test $\to$ 평가

**특성공학** : 데이터 전처리

 `-` Scaling :  숫자 데이터의 스케일 조정 (Standard / MinMax / Robust)
 
 `-` Encoding: 숫자 데이터를 문자로 변환 (Label / One hot)
 
 `-` Cross Validation: 교차 검증을 통해, 여러개의 모델
 
 `-` Model Hyperparameter Tunning: 알고리즘 내 구조를 Tunning

**지도 학습 ($Y$)**

`-` $Y$ 연속형 (숫자) : Regression 회귀 / 예측

`-` $Y$ 범주형 (문자) : Classification 분류


**비지도 학습**

 `-` 군집분석 (Clustering)
 
 `-` 차원축소  / 연관분석

## Practice

In [1]:
# Pandas 라이브러리 호출
import pandas as pd

In [5]:
# 데이터 파일 불러오기
df1 = pd.read_csv('./data/Breast_Cancer/data1.csv')

In [6]:
# 데이터 구조 및 타입 확인
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave_points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [8]:
df1['diagnosis'] # Target

0      M
1      M
2      M
3      M
4      M
      ..
564    M
565    M
566    M
567    M
568    B
Name: diagnosis, Length: 569, dtype: object

In [10]:
# 문자를 숫자로 변경
df1['Target'] = df1['diagnosis'].replace('M',1).replace('B',0)

In [12]:
df1['diagnosis'].unique()

array(['M', 'B'], dtype=object)

### 01. Logistic Rgression

In [16]:
Y = df1['diagnosis']
X = df1[['radius_mean', 'perimeter_mean', 'area_mean']]

In [17]:
# Scikit Learn :  특성공학 + 머신러닝
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [18]:
# 학습 및 테스트 데이터 셋 분할
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.3)

In [20]:
# 모델 학습
model = LogisticRegression()
model.fit(X_train, Y_train)

In [21]:
# 예측값 계산
Y_test_pred = model.predict(X_test)
Y_test_pred

array(['B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M',
       'B', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'M',
       'M', 'B', 'B', 'M', 'B', 'B', 'M', 'M', 'M', 'M', 'M', 'M', 'B',
       'B', 'M', 'B', 'M', 'M', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'B',
       'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B',
       'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'M', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'M',
       'M', 'B', 'M', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'M', 'M',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'M',
       'B', 'B'], dtype=object)

In [23]:
# 분류 : 정확도 / 정확도 / 정밀도 / 재현율 / f1
print(classification_report(Y_test, Y_test_pred))

              precision    recall  f1-score   support

           B       0.89      0.90      0.90       110
           M       0.82      0.80      0.81        61

    accuracy                           0.87       171
   macro avg       0.85      0.85      0.85       171
weighted avg       0.87      0.87      0.87       171



- M과 B의 차이가 크지 않아야하고, 만약 차이가 크다면 모델을 다시 만들어야 한다.
- macro avg를 보고 모델이 얼마나 잘 생성되었는지 판단을 하면 된다.

### 02. 특성공학

In [30]:
import pandas as pd

In [32]:
df1 = pd.read_csv('./data/Breast_Cancer/data1.csv')
df1.head(2)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [33]:
print(df1.shape)

(569, 32)


In [34]:
print(df1.columns) # 데이터 프레임 내 항목들을 확인

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave_points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')


In [35]:
print(df1.info()) # 데이터의 요약을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave_points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

#### 스케일링
- 연속형 데이터 간 스케일을 맞추기 위해
- Standard Scaler : 평균이 0, 표준편차 1
- MinMax Scaler : 최솟값 0, 최댓값 1
- Robust Scaler : 중앙값 0, IQR 1 (25% ~ 75%)

In [36]:
# scikit-learn에 preprocessing이라는 라이브러리 안에있는 StandardScaler함수를 가져오자.
from sklearn.preprocessing import StandardScaler

In [38]:
df2= df1[['radius_mean', 'perimeter_mean','area_mean']]

In [39]:
print(df2.describe())

       radius_mean  perimeter_mean    area_mean
count   569.000000      569.000000   569.000000
mean     14.127292       91.969033   654.889104
std       3.524049       24.298981   351.914129
min       6.981000       43.790000   143.500000
25%      11.700000       75.170000   420.300000
50%      13.370000       86.240000   551.100000
75%      15.780000      104.100000   782.700000
max      28.110000      188.500000  2501.000000


In [40]:
model = StandardScaler()
model.fit(df2) # 데이터를 가져와 수식을 적용

In [45]:
pd.DataFrame(model.transform(df2)).describe() # transform으로 데이터를 바꿔줍니다.

Unnamed: 0,0,1,2
count,569.0,569.0,569.0
mean,-1.256562e-16,-1.272171e-16,-1.900452e-16
std,1.00088,1.00088,1.00088
min,-2.029648,-1.984504,-1.454443
25%,-0.6893853,-0.6919555,-0.6671955
50%,-0.2150816,-0.23598,-0.2951869
75%,0.4693926,0.4996769,0.3635073
max,3.971288,3.97613,5.250529


**예제1) data1.csv 파일을 불러와 raidus_mean 값을 MinMaxScaler를 이용해  Scaling을 실시하시오. (최솟값과 최댓값이 어떻게 변했는지 확인)**

In [52]:
print(df1['radius_mean'].mean())

14.127291739894563


In [51]:
print(df1['radius_mean'].max())

28.11


In [53]:
print(df1['radius_mean'].min())

6.981


In [54]:
from sklearn.preprocessing import MinMaxScaler

In [56]:
model = MinMaxScaler() # scaler를 불러와서
model.fit(df1[['radius_mean']]) # radius_mean이라는 데이터를 학습

In [59]:
df1['Mean Radius(scaler)'] = model.transform(df1[['radius_mean']]) # 변환작업

In [64]:
print(df1['Mean Radius(scaler)'].min())

0.0


In [65]:
print(df1['Mean Radius(scaler)'].max())

1.0


- 최솟값이 0, 최댓값이 1인 형태로 잘 바뀐 것을 확인

**예제2. data1.csv파일에서 Mean Area 값을 Robust Scaler를 이용하여, 스케일링을 실시하고, 최댓값과 중앙값을 계산하시오.**

In [66]:
from sklearn.preprocessing import RobustScaler

In [69]:
print(df1['area_mean'].median())

551.1


In [67]:
model = RobustScaler()
model.fit(df1[['area_mean']])

In [72]:
df1['Mean Area(scale)'] = model.transform(df1[['area_mean']])

In [73]:
df1['Mean Area(scale)'].median()

0.0

**예제3. data1.csv파일에서 diagnosis를 분류하는 분류 모델을 만들고자 한다. 이때, 설명변수를 MinMax Scaling을 실시하여, Modeling을 하려고 한다. 모델을 구성한 뒤 Test set의 정확도를 계산**

- Scaling + Modeling : Pipeline

In [74]:
# 학습데이터와 테스트 데이터를 분할
from sklearn.model_selection import train_test_split
# 분류 알고리즘을 이용해 학습
from sklearn.tree import DecisionTreeClassifier
# 스케일링과 모델을 동시에 처리
from sklearn.pipeline import Pipeline
# 스케일링을 해주는 함수
from sklearn.preprocessing import MinMaxScaler
# 분류모델 평가
from sklearn.metrics import classification_report

- X / Y 설정
- Train / Test Set 구성
- Model을 이용해 학습
- 평가

In [75]:
df1 = pd.read_csv('./data/Breast_Cancer/data1.csv')
df1.head(2)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


In [76]:
df1['Target'] = df1['diagnosis'].replace('M',1).replace('B',0)

In [77]:
Y = df1['Target']
X = df1[['radius_mean', 'perimeter_mean','area_mean']]

In [78]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(398, 3)
(171, 3)
(398,)
(171,)


In [79]:
model = Pipeline( [('scaler', MinMaxScaler()),
                   ('model', DecisionTreeClassifier())] )

In [80]:
model.fit(X_train, Y_train)

In [81]:
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)

In [82]:
print(classification_report(Y_train, Y_train_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       246
           1       1.00      1.00      1.00       152

    accuracy                           1.00       398
   macro avg       1.00      1.00      1.00       398
weighted avg       1.00      1.00      1.00       398



In [83]:
print(classification_report(Y_test, Y_test_pred))

              precision    recall  f1-score   support

           0       0.92      0.89      0.90       111
           1       0.81      0.85      0.83        60

    accuracy                           0.88       171
   macro avg       0.86      0.87      0.87       171
weighted avg       0.88      0.88      0.88       171



- 파이프라인을 사용하여 하이퍼파라미터 튜닝까지 할 수 있다.

**Logistic Regreesion Modeling**

In [85]:
from sklearn.linear_model import LogisticRegression

In [86]:
model = Pipeline( [('scaler', MinMaxScaler()), 
                   ('model', LogisticRegression())] )

In [87]:
model.fit(X_train, Y_train)

In [88]:
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)

In [91]:
print(classification_report(Y_train, Y_train_pred))

              precision    recall  f1-score   support

           0       0.85      0.97      0.91       246
           1       0.94      0.73      0.82       152

    accuracy                           0.88       398
   macro avg       0.90      0.85      0.87       398
weighted avg       0.89      0.88      0.88       398



In [92]:
print(classification_report(Y_test, Y_test_pred))

              precision    recall  f1-score   support

           0       0.87      0.99      0.92       111
           1       0.98      0.72      0.83        60

    accuracy                           0.89       171
   macro avg       0.92      0.85      0.88       171
weighted avg       0.91      0.89      0.89       171



- DecisionTree와 비교해 봤을 때 일반화는 Logistic Regression이 더 잘되는 것 같다.