### Cho dữ liệu glass.data.txt
### Sử dụng thuật toán ADABoosting/XGBoost & thuật toán cơ sở để dự đoán loại kính dựa trên các thông tin được cung cấp
1. Đọc dữ liệu và gán cho biến data. Xem thông tin data: shape, type, head(), tail(), info. Tiền xử lý dữ liệu (nếu cần)
2. Tạo inputs data với các cột trừ cột type of class, và outputs data với 1 cột là type of class
3. Từ inputs data và outputs data => Tạo X_train, X_test, y_train, y_test với tỷ lệ 70-30
4. Thực hiện ADABoosting/XGBoost với X_train, y_train
5. Dự đoán y từ X_test => so sánh với y_test
6. Đánh giá mô hình => Nhận xét
7. Ghi mô hình (nếu mô hình tốt sau khi đánh giá)

### Attribute Information:
1. Id number: 1 to 214
2. RI: refractive index
3. Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
4. Mg: Magnesium
5. Al: Aluminum
6. Si: Silicon
7. K: Potassium
8. Ca: Calcium
9. Ba: Barium
10. Fe: Iron
11. Type of glass: (class attribute) -- 1 building_windows_float_processed -- 2 building_windows_non_float_processed -- 3 vehicle_windows_float_processed -- 4 vehicle_windows_non_float_processed (none in this database) -- 5 containers -- 6 tableware -- 7 headlamps

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv("../../Data/glass.data.txt", sep=",", header=None)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       214 non-null    int64  
 1   1       214 non-null    float64
 2   2       214 non-null    float64
 3   3       214 non-null    float64
 4   4       214 non-null    float64
 5   5       214 non-null    float64
 6   6       214 non-null    float64
 7   7       214 non-null    float64
 8   8       214 non-null    float64
 9   9       214 non-null    float64
 10  10      214 non-null    int64  
dtypes: float64(9), int64(2)
memory usage: 18.5 KB


In [3]:
data.shape

(214, 11)

In [4]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [5]:
data.groupby(10).count()[0]

10
1    70
2    76
3    17
5    13
6     9
7    29
Name: 0, dtype: int64

In [6]:
# The columns that we will be making predictions with.
inputs = data.iloc[:,1:-1]
inputs.shape

(214, 9)

In [7]:
inputs.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0


In [8]:
# The column that we want to predict.
outputs = data[10]
outputs = np.array(outputs)
outputs.shape

(214,)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(inputs, outputs, test_size=0.30, random_state=1)

Chúng ta không áp dụng AdaBoostClassifier với KNN vì KNeighborsClassifier không hỗ trợ sample_weight (mà trong AdaBoostClassifier cần)

### AdaBoost

In [10]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [11]:
# mặc định là DecisionTreeClassifier() nên có thể không cần ghi
ml = DecisionTreeClassifier()
boosting = AdaBoostClassifier(n_estimators=100, base_estimator=ml, learning_rate=1)

In [12]:
# Train model
model_new = boosting.fit(X_train, y_train)

In [13]:
model_new.score(X_train, y_train)

1.0

In [14]:
model_new.score(X_test, y_test)

0.676923076923077

### Kết luận: Overfitting

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
ml_1 = RandomForestClassifier(n_estimators=100)
boosting_1 = AdaBoostClassifier(n_estimators=100, base_estimator=ml_1, learning_rate=0.1)

In [17]:
# Train model
boosting_1.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=RandomForestClassifier(), learning_rate=0.1,
                   n_estimators=100)

In [18]:
boosting_1.score(X_train, y_train)

1.0

In [19]:
boosting_1.score(X_test, y_test)

0.8461538461538461

In [20]:
from sklearn.model_selection import cross_val_score

In [21]:
scores1 = cross_val_score(boosting_1, inputs, outputs, cv=20)
scores1



array([0.54545455, 0.81818182, 0.72727273, 0.72727273, 0.90909091,
       0.81818182, 0.72727273, 0.81818182, 0.72727273, 0.63636364,
       0.54545455, 0.90909091, 0.90909091, 0.90909091, 0.4       ,
       0.5       , 1.        , 0.7       , 0.7       , 0.8       ])

In [22]:
display(np.mean(scores1),np.std(scores1))

0.7413636363636364

0.1528645216636976

#### Kết luận: Vẫn overfitting nhưng có cải thiện hơn
#### Còn model nào tốt hơn không? Cho kết quả.
#### Thử áp dụng bài toán này với XGBoost.

### XGBoost

In [24]:
import xgboost as xgb

In [25]:
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='multi:softprob', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [26]:
xgb_model.score(X_train, y_train)

1.0

In [27]:
xgb_model.score(X_test, y_test)

0.8461538461538461

In [28]:
from sklearn.model_selection import cross_val_score

In [29]:
scores2 = cross_val_score(xgb_model, inputs, outputs, cv=20)
scores2









































array([0.72727273, 0.81818182, 0.81818182, 0.63636364, 0.81818182,
       0.81818182, 0.63636364, 0.81818182, 0.81818182, 0.81818182,
       0.63636364, 1.        , 0.81818182, 0.81818182, 0.4       ,
       0.4       , 1.        , 0.8       , 0.9       , 0.8       ])

In [30]:
display(np.mean(scores2),np.std(scores2))

0.7650000000000001

0.15396347640306027