### Implementacja modelu uczenia maszynowego dla K=3 oraz dla podziału na 75% system treningowy i 25% system testowy

Wczytanie bibliotek i pliku z danymi

In [1]:
import random
import pandas as pd
import numpy as np
from scipy.spatial import KDTree
from sklearn.metrics import mean_absolute_error
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from collections import Counter

In [2]:
houses = pd.read_csv('wine_data/white_red.csv', sep=';')
values = houses['quality']
print(values)

0        5
1        5
2        5
3        6
4        5
        ..
32480    6
32481    5
32482    6
32483    7
32484    6
Name: quality, Length: 32485, dtype: int64


In [3]:
houses.drop('quality', 1, inplace=True)
houses = (houses - houses.mean()) / (houses.max() - houses.min())
houses = houses[['total sulfur dioxide', 'pH', 'alcohol']]

In [4]:
kdtree = KDTree(houses)

def classify(query_point, k):
    _, idx = kdtree.query(query_point, k)
    return np.argmax(np.bincount(values.iloc[idx]))


In [5]:
test_rows = random.sample(houses.index.tolist(), int(round(len(houses) * .25)))  # 25%
train_rows = set(range(len(houses))) - set(test_rows)
df_test = houses.loc[test_rows]
df_train = houses.drop(test_rows)
test_values = values.loc[test_rows]
train_values = values.loc[train_rows]
train_classified_values = []
test_classified_values = []
train_actual_values = []
test_actual_values = []

In [6]:
for _id, row in df_train.iterrows():
    train_classified_values.append(classify(row, 3))
    train_actual_values.append(train_values[_id])

print('wartość błedu dla k=3 na systemie treningowym 75% wynosi: ',(mean_absolute_error(train_classified_values, train_actual_values)))

for _id, row in df_test.iterrows():
    test_classified_values.append(classify(row, 3))
    test_actual_values.append(test_values[_id])
print('wartość błedu dla k=3 na systemie testowym 25% wynosi: ',(mean_absolute_error(test_classified_values, test_actual_values)))


wartość błedu dla k=3 na systemie treningowym 75% wynosi:  0.004720078804793958
wartość błedu dla k=3 na systemie testowym 25% wynosi:  0.005541189508681197


#### Następnie sprawdzam dokładność modelu

In [7]:
houses = pd.read_csv('wine_data/white_red.csv', sep=';')

In [8]:
X = houses.drop('quality', axis=1)
y = houses['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)

In [9]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = (accuracy_score(y_test, y_pred)) * 100

print(f'Dokladnosc wytrenowanego modelu dla systemu (treningowy i testowy w stosunku 75%:25%): {acc}%')

Dokladnosc wytrenowanego modelu dla systemu (treningowy i testowy w stosunku 75%:25%): 97.32824427480917%


###### Patrząc na dkoładność można by powiedzieć, że  model jest świetny.