<a href="https://colab.research.google.com/github/karomatusiak/pum_projekt/blob/main/ModelePart2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MODELE REGRESYJNE

In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, mean_squared_error, r2_score, roc_auc_score, precision_score, recall_score, f1_score

Wczytanie danych

In [None]:
file_path = 'data_cleaned_reduced.csv'
data = pd.read_csv(file_path)

Podział danych na zmienne wejściowe i wyjściowe

In [None]:
X = data.drop('quality', axis=1)
y = data['quality']

Podział na zestaw treningowy i testowy

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model RandomForestRegressor

In [None]:
model_rf_regressor = RandomForestRegressor(random_state=42)
model_rf_regressor.fit(X_train, y_train)
predictions_rf_regressor = model_rf_regressor.predict(X_test)

cv_scores_rf_regressor = cross_val_score(estimator=model_rf_regressor, X=X_train, y=y_train, cv=20, scoring='r2')
rmse_rf_regressor = np.sqrt(mean_squared_error(y_test, predictions_rf_regressor))
r2_rf_regressor = model_rf_regressor.score(X_test, y_test)


In [None]:
print("Cross-validated R^2 scores: ", list(map(lambda x: round(x, 4), cv_scores_rf_regressor)))
print("Mean R^2 from CV: ", round(cv_scores_rf_regressor.mean(), 4))
print("Standard Deviation of R-squared from CV (RMSE):", round(rmse_rf_regressor, 4))
print('RandomForestRegressor R2:', round(r2_rf_regressor, 4))

Cross-validated R^2 scores:  [0.2145, 0.0831, 0.2848, 0.3234, 0.319, 0.2021, 0.3858, 0.2623, 0.3213, 0.2111, 0.2174, 0.224, 0.3088, 0.3512, 0.4228, 0.2457, 0.2903, 0.2319, 0.2734, 0.1434]
Mean R^2 from CV:  0.2658
Standard Deviation of R-squared from CV (RMSE): 0.2001
RandomForestRegressor R2: 0.3561


## Model XGBoost

In [None]:
XGBR = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.8, colsample_bytree=0.8, random_state=42)
XGBR.fit(X_train, y_train)
y_pred = XGBR.predict(X_test)

cv_scores = cross_val_score(estimator=XGBR, X=X_train, y=y_train, cv=20, scoring='r2')
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = XGBR.score(X_test, y_test)

In [None]:
print("Cross-validated R^2 scores: ", list(map(lambda x: round(x, 4), cv_scores)))
print("Mean R^2 from CV: ", round(cv_scores.mean(), 4))
print("Standard Deviation of R-squared from CV (RMSE):", round(rmse, 4))
print('XGBRegressor R2:', round(r2, 4))

Cross-validated R^2 scores:  [0.0923, 0.0376, 0.248, 0.2623, 0.2635, 0.1329, 0.3345, 0.1239, 0.1441, 0.1266, 0.1003, 0.1716, 0.2819, 0.2489, 0.3655, 0.2657, 0.2499, 0.1193, 0.1064, 0.1147]
Mean R^2 from CV:  0.1895
Standard Deviation of R-squared from CV (RMSE): 0.2036
XGBRegressor R2: 0.3331


# MODELE KLASYFIKACYJNE

Stworzenie przedziałów, aby model mógł przewidywać zamiast wartości ciągłych kategorie

In [None]:
bins = [0, 0.4, 0.7, 1]
labels = ['bad', 'average', 'good']
data['quality_binned'] = pd.cut(data['quality'], bins=bins, labels=labels, include_lowest=True)

In [None]:
print(data['quality_binned'].value_counts())

quality_binned
average    1505
bad         936
good        619
Name: count, dtype: int64


Podział danych na cechy i etykiety


In [None]:
data_cleaned = data.dropna(subset=['quality_binned'])

X = data_cleaned.drop(columns=['quality', 'quality_binned'])
y = data_cleaned['quality_binned']

Podział danych na zestawy treningowe i testowe

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model RandomForestClassifier

In [None]:
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

Ocena wyników

In [None]:
print("RandomForestClassifier - Dokładność: ", accuracy)
print("RandomForestClassifier - Raport klasyfikacji:\n", report)

cv_scores = cross_val_score(clf, X, y, scoring='roc_auc_ovr', cv=10)
print('Cross-validation score with roc_auc_ovr scoring:', cv_scores.mean())

RandomForestClassifier - Dokładność:  0.6127450980392157
RandomForestClassifier - Raport klasyfikacji:
               precision    recall  f1-score   support

     average       0.59      0.76      0.66       295
         bad       0.67      0.54      0.60       202
        good       0.62      0.37      0.46       115

    accuracy                           0.61       612
   macro avg       0.63      0.56      0.57       612
weighted avg       0.62      0.61      0.60       612

Cross-validation score with roc_auc_ovr scoring: 0.7230030043522062


## Model DecisionTreeClassifier

In [None]:
clf_dt = DecisionTreeClassifier(random_state=42)
clf_dt.fit(X_train, y_train)

y_pred_dt = clf_dt.predict(X_test)

accuracy_dt = accuracy_score(y_test, y_pred_dt)
report_dt = classification_report(y_test, y_pred_dt)

Ocena wyników

In [None]:
print("DecisionTreeClassifier - Dokładność: ", accuracy_dt)
print("DecisionTreeClassifier - Raport klasyfikacji:\n", report_dt)

cv_scores_dt = cross_val_score(clf_dt, X, y, scoring='roc_auc_ovr', cv=10)
print('DecisionTreeClassifier - Cross-validation score with roc_auc_ovr scoring:', cv_scores_dt.mean())

DecisionTreeClassifier - Dokładność:  0.5032679738562091
DecisionTreeClassifier - Raport klasyfikacji:
               precision    recall  f1-score   support

     average       0.53      0.57      0.55       295
         bad       0.55      0.48      0.51       202
        good       0.37      0.39      0.38       115

    accuracy                           0.50       612
   macro avg       0.48      0.48      0.48       612
weighted avg       0.51      0.50      0.50       612

DecisionTreeClassifier - Cross-validation score with roc_auc_ovr scoring: 0.5705362719090568


# Podsumowanie

Porównując oba modele regresyjne:
*	Model RandomForestRegressor ma lepszy współczynnik R^2  na zbiorze testowym i wyższa średnia wartość  R^2  z walidacji krzyżowej, co wskazuje na lepszą zdolność generalizacji modelu niż XGBoost.

Z tego powodu, RandomForestRegressor jest lepszym wyborem dla problemu regresji.

Porównując oba modele klasyfikacyjne:

* RandomForestClassifier wykazuje wyższą dokładność i lepsze wyniki AUC w walidacji krzyżowej w porównaniu do DecisionTreeClassifier.
* Stabilność wyników oraz wyższa dokładność sugerują, że RandomForestClassifier jest bardziej odpowiednim modelem dla problemu klasyfikacji.