## [作業重點]
使用 Sklearn 中的線性迴歸模型，來訓練各種資料集，務必了解送進去模型訓練的**資料型態**為何，也請了解模型中各項參數的意義

## 作業
試著使用 sklearn datasets 的其他資料集 (wine, boston, ...)，來訓練自己的線性迴歸模型。

### HINT: 注意 label 的型態，確定資料集的目標是分類還是回歸，在使用正確的模型訓練！

##

##

# 1.載入需要的模組

In [98]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score 

import warnings
warnings.filterwarnings('ignore')                   # 忽略警告訊息

##
##

##

# 2.線性回歸模型預測

## (1).資料讀取、處理

In [28]:
# 讀取使用的資料
dataset = datasets.fetch_california_housing()

# 將資料中的資訊(data)轉為 DataFrame
data = pd.DataFrame( dataset.data , columns=dataset.feature_names)  # 其實可以直接使用 dataset.data 做資料切割(這邊是方便觀看)

# 設定目標值(Target)
target = dataset.target

# 印出資料 DataFrame (不含目標值)
data.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


## (2).資料分割、模型訓練、資料預測

In [30]:
# 分割資料 
train_X,test_X = train_test_split(data, test_size=0.1, random_state=4)      # 將特徵分為[訓練集]、[測試集]
train_Y,test_Y = train_test_split(target, test_size=0.1, random_state=4)    # 將目標值分為[訓練集]、[測試集]

# 設定線性回規模型
lr = linear_model.LinearRegression()    

# 訓練模型
lr.fit(train_X, train_Y)

# 預測資料
pred_Y = lr.predict(test_X)


## (3).預測結果評估分數

In [36]:
# 使用 MAE 確認預測資料的[準確度]
print("Mean Square Error : ",mean_squared_error(test_Y, pred_Y))

Mean Square Error :  0.5150369595361362


##
##

##

# 3. Logistic Regression 分類模型預測

## (1).資料讀取、處理

In [100]:
# 讀取使用的資料
dataset = datasets.load_breast_cancer()

# 將資料中的資訊(data)轉為 DataFrame
data = pd.DataFrame(dataset.data , columns = dataset.feature_names )    # 其實可以直接使用 dataset.data 做資料切割(這邊是方便觀看)
print(data.shape)

# 設定目標值(target)
target = dataset.target

# 印出資料(不含目標值)
data.head()

(569, 30)


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## (2).資料分割、模型訓練、資料預測

In [102]:
# 將資料分割
train_X, test_X = train_test_split( data , test_size=0.1, random_state=7)   # 將特徵分為[訓練集]、[測試集]
train_Y, test_Y = train_test_split( target, test_size=0.1,random_state=7)   # 將目標值分為[訓練集]、[測試集]

# 設定 logistic regression 模型
logr = linear_model.LogisticRegression()

# 訓練模型
logr.fit( train_X, train_Y)

# 預測資料
pred_Y = logr.predict(test_X)
print(pred_Y)

[1 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1
 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 0 0 1]


## (3).預測結果評估分數

In [103]:
# 使用 Accuracy (準確率) 來判斷預測好壞
acc = accuracy_score(test_Y, pred_Y)
print( "Accuracy : " , acc)

Accuracy :  0.9298245614035088


##

##

# 4. Logistic Regression 多分類模型預測

## (1).資料讀取、處理

In [104]:
# 讀取使用的資料
dataset = datasets.load_wine()
data = pd.DataFrame( dataset.data ,columns = dataset.feature_names) # 其實可以直接使用 dataset.data 做資料切割(這邊是方便觀看)

# 設定目標值
target = dataset.target
print(dataset.target_names)

# 印出資料(不含目標值)
data.head()

['class_0' 'class_1' 'class_2']


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


## (2).資料分割、模型訓練、資料預測

In [108]:
# 分割資料
train_X, test_X = train_test_split( data, test_size=0.1 , random_state=6)
train_Y, test_Y = train_test_split( target, test_size=0.1, random_state=6)

# 設定 logistic regression 模型
logr = linear_model.LogisticRegression(multi_class='ovr')   # 設定多分類方法為[ one vs rest ]

# 訓練模型
logr.fit(train_X, train_Y)

# 預測模型
pred_Y = logr.predict( test_X )
pred_Y

array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 2, 1, 1, 0, 0, 0])

## (3).預測結果評估分數

In [110]:
# 使用 Accuracy (準確率) 來判斷預測好壞
acc = accuracy_score(test_Y, pred_Y)
print("Accuracy : " , acc)

Accuracy :  0.9444444444444444
