## Лабораторная работа №17. Разработка модели машинного обучения для выбранной предметной области.

Используемый набор данных: [Car Evaluation](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, confusion_matrix
import os
import requests

%matplotlib inline
pd.options.display.max_columns = None

Загрузим и подготовим данные.

In [2]:
def downloadFile(url, filePath):
    if not os.path.exists(filePath):
        req = requests.get(url)
        f = open(filePath, "wb")
        f.write(req.content)
        f.close

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/car"
downloadFile(url + "/car.data", "dataset/car.data")
downloadFile(url + "/car.names", "dataset/car.names")

В данном наборе все признаки являются категориальными.

In [3]:
headers = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"]
data = pd.read_csv("dataset/car.data", names=headers)
data = data.astype({"buying": "category", "maint": "category", "doors": "category", "persons": "category", "lug_boot": "category", "safety": "category", "class": "category"})

display(data.dtypes)
display(data.isna().sum())

buying      category
maint       category
doors       category
persons     category
lug_boot    category
safety      category
class       category
dtype: object

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

Пропуски в данных отсутствуют.

In [4]:
data.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,5more,more,small,med,unacc
freq,432,432,432,576,576,576,1210


In [5]:
data.sample(20)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
1665,low,low,3,more,small,low,unacc
1280,med,low,5more,4,small,high,good
1338,low,vhigh,3,4,big,low,unacc
1506,low,high,5more,more,med,low,unacc
1077,med,high,5more,more,big,low,unacc
1587,low,med,4,more,med,low,unacc
13,vhigh,vhigh,2,4,med,med,unacc
471,high,vhigh,3,4,med,low,unacc
914,med,vhigh,3,more,med,high,acc
357,vhigh,low,3,2,big,low,unacc


In [6]:
data.duplicated().any()

False

Дублирующиеся данные отсутствуют.

Преобразуем категориальные признаки в числовые для использования в классификаторе.

In [7]:
le = LabelEncoder()
for col in data.columns:
    data[col] = le.fit_transform(data[col])

data.sample(10)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
165,3,0,2,0,1,1,2
1394,1,3,3,1,0,0,0
1019,2,0,1,2,2,0,0
1036,2,0,2,1,2,2,2
477,0,3,1,2,2,1,2
97,3,3,3,1,0,2,2
1060,2,0,3,0,0,2,2
1601,1,2,3,0,0,0,2
21,3,3,0,2,1,1,2
748,0,2,3,2,2,2,2


Подготовим тестовый набор данных и набор для обучения.

In [8]:
X = data.drop(columns=["class"]).copy()
y = data["class"].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=32)

Решим задачу классификации с использование нейросети вида "[Многослойный перцептрон](https://en.wikipedia.org/wiki/Multilayer_perceptron)". Обучим классификатор и выполним предсказание.

In [9]:
mlp_clf = MLPClassifier(hidden_layer_sizes=(5),max_iter=1000, random_state=32, shuffle=True, verbose=False)

mlp_clf.fit(X_train, y_train)
y_pred = mlp_clf.predict(X_test)

Выполним оценку качества классификации.

In [10]:
y_score = cross_val_score(mlp_clf, X_train, y_train, cv=10)

metrics = [
    ["Mean squared error (MSE)", mean_squared_error(y_test, y_pred)],
    ["Mean absolute error (MAE)", mean_absolute_error(y_test, y_pred)],
    ["Accuracy", mlp_clf.score(X_test, y_test)],
    ["Cross validation Accuracy", y_score.mean()]
]
pd.DataFrame(data=metrics, columns=["Metric", "Score"])

Unnamed: 0,Metric,Score
0,Mean squared error (MSE),0.71979
1,Mean absolute error (MAE),0.327496
2,Accuracy,0.833625
3,Cross validation Accuracy,0.827159


Подберем оптимальные параметры для нейросети.

In [11]:
params = {
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['lbfgs', 'adam', 'sgd'],
    'alpha': 10.0 ** -np.arange(1, 3),
    'hidden_layer_sizes': [(3), (4), (5), (12), (18), (24) ]
    }

mlp_clf_cv = MLPClassifier(random_state=32)
gscv = GridSearchCV(mlp_clf_cv, params, cv=10, n_jobs=10)
gscv_pred = gscv.fit(X_train, y_train).predict(X_test)
gscv.best_params_

{'activation': 'logistic',
 'alpha': 0.1,
 'hidden_layer_sizes': 18,
 'solver': 'lbfgs'}

Выполним оценку качества классификации с оптимальными параметрами.

In [12]:
y_score = cross_val_score(mlp_clf, X_train, y_train, cv=10)

metrics = [
    ["Mean squared error (MSE)", mean_squared_error(y_test, gscv_pred)],
    ["Mean absolute error (MAE)", mean_absolute_error(y_test, gscv_pred)],
    ["Accuracy", gscv.score(X_test, y_test)],
    ["Cross validation Accuracy", gscv.best_score_]
]
pd.DataFrame(data=metrics, columns=["Metric", "Score"])

Unnamed: 0,Metric,Score
0,Mean squared error (MSE),0.082312
1,Mean absolute error (MAE),0.04028
2,Accuracy,0.978984
3,Cross validation Accuracy,0.987046


При оптимальных параметрах точность классификации выше, величина ошибки ниже.