### Sử dụng thuật toán Decision Tree để dự đoán nhiệt độ (Temperature_c) dựa trên các thông tin được cung cấp.
1. Đọc dữ liệu và gán cho biến data. Xem thông tin data: shape, type, head(), tail(), info. Tiền xử lý dữ liệu (nếu cần)
2. Từ inputs data và outputs data => Tạo X_train, X_test, y_train, y_test với tỷ lệ 80:20
3. Thực hiện Decision Tree với X_train, y_train
4. Dự đoán y từ X_test => so sánh với y_test
5. Xem kết quả => Nhận xét model
6. Ghi model nếu model phù hợp

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error

In [2]:
# import some data to play with
data = pd.read_csv("../../Data/weather.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Temperature_c         10000 non-null  float64
 1   Humidity              10000 non-null  float64
 2   Wind_Speed_kmh        10000 non-null  float64
 3   Wind_Bearing_degrees  10000 non-null  int64  
 4   Visibility_km         10000 non-null  float64
 5   Pressure_millibars    10000 non-null  float64
 6   Rain                  10000 non-null  int64  
 7   Description           10000 non-null  object 
dtypes: float64(5), int64(2), object(1)
memory usage: 625.1+ KB


In [3]:
# Kiểm tra dữ liệu null
print(data.isnull().sum())
# => Không có dữ liệu null

Temperature_c           0
Humidity                0
Wind_Speed_kmh          0
Wind_Bearing_degrees    0
Visibility_km           0
Pressure_millibars      0
Rain                    0
Description             0
dtype: int64


In [4]:
data = data.dropna()

In [5]:
data.head()

Unnamed: 0,Temperature_c,Humidity,Wind_Speed_kmh,Wind_Bearing_degrees,Visibility_km,Pressure_millibars,Rain,Description
0,-0.555556,0.92,11.27,130,8.05,1021.6,0,Cold
1,21.111111,0.73,20.93,330,16.1,1017.0,1,Warm
2,16.6,0.97,5.9731,193,14.9086,1013.99,1,Normal
3,1.6,0.82,3.22,300,16.1,1031.59,1,Cold
4,2.194444,0.6,10.8836,116,9.982,1020.88,1,Cold


In [6]:
data.tail()

Unnamed: 0,Temperature_c,Humidity,Wind_Speed_kmh,Wind_Bearing_degrees,Visibility_km,Pressure_millibars,Rain,Description
9995,10.022222,0.95,10.2396,20,4.0089,1007.41,1,Normal
9996,8.633333,0.64,11.0446,80,9.982,1031.33,1,Normal
9997,5.977778,0.93,11.0446,269,14.9086,1014.21,1,Normal
9998,9.788889,0.78,8.1788,231,7.8246,1005.02,1,Normal
9999,11.138889,0.79,14.2485,131,10.2557,1010.14,1,Normal


In [7]:
# The columns that we will be making predictions with.
inputs = data.drop(["Temperature_c"], axis=1)
inputs.shape

(10000, 7)

In [8]:
inputs.head()

Unnamed: 0,Humidity,Wind_Speed_kmh,Wind_Bearing_degrees,Visibility_km,Pressure_millibars,Rain,Description
0,0.92,11.27,130,8.05,1021.6,0,Cold
1,0.73,20.93,330,16.1,1017.0,1,Warm
2,0.97,5.9731,193,14.9086,1013.99,1,Normal
3,0.82,3.22,300,16.1,1031.59,1,Cold
4,0.6,10.8836,116,9.982,1020.88,1,Cold


In [9]:
inputs = pd.get_dummies(inputs)
inputs.head()

Unnamed: 0,Humidity,Wind_Speed_kmh,Wind_Bearing_degrees,Visibility_km,Pressure_millibars,Rain,Description_Cold,Description_Normal,Description_Warm
0,0.92,11.27,130,8.05,1021.6,0,1,0,0
1,0.73,20.93,330,16.1,1017.0,1,0,0,1
2,0.97,5.9731,193,14.9086,1013.99,1,0,1,0
3,0.82,3.22,300,16.1,1031.59,1,1,0,0
4,0.6,10.8836,116,9.982,1020.88,1,1,0,0


In [10]:
inputs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Humidity              10000 non-null  float64
 1   Wind_Speed_kmh        10000 non-null  float64
 2   Wind_Bearing_degrees  10000 non-null  int64  
 3   Visibility_km         10000 non-null  float64
 4   Pressure_millibars    10000 non-null  float64
 5   Rain                  10000 non-null  int64  
 6   Description_Cold      10000 non-null  uint8  
 7   Description_Normal    10000 non-null  uint8  
 8   Description_Warm      10000 non-null  uint8  
dtypes: float64(4), int64(2), uint8(3)
memory usage: 576.2 KB


In [11]:
# The column that we want to predict.
outputs = data["Temperature_c"]
outputs = np.array(outputs)
outputs.shape

(10000,)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(inputs, outputs, test_size=0.3, random_state=42)

In [13]:
# Create decision tree regressor object
model = DecisionTreeRegressor()
# Train model
model.fit(X_train, y_train)

DecisionTreeRegressor()

In [14]:
# Kiểm tra độ chính xác
print("The Train/ Score is: ", model.score(X_train,y_train)*100,"%")
print("The Test/ Score is: ", model.score(X_test,y_test)*100,"%")

The Train/ Score is:  100.0 %
The Test/ Score is:  78.2085758215364 %


In [15]:
# Tính MSE
y_pred = model.predict(X_test)
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))

Mean Squared Error: 19.05843060745667
Mean Absolute Error: 3.26787222223


### Nhận xét:
* Training và Testing chênh nhau ~22% => có hiện tượng overfitting
* Mô hình trên cho R^2 khá ~ 0.78, cho thấy nó fit 78% dữ liệu
* MSE ~ 19 & MAE ~ 3.3 => mô hình chưa ổn lắm, cần tìm cách giải quyết overfitting