# Lab - Klasyfikacja

## Zadania

1. Wczytaj zbiór `diabetes.csv`. Przygotuj dane do modelowania (podziel na zbiór treningowy, walidacyjny i testowy, następnie skaluj). 
Stwórz modele wykrywające przypadki cukrzycy (kolumna `Outcome`): 
    - [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html),
    - [`GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html),
    - [`LinearDiscriminantAnalysis`](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html),
    - [`QuadraticDiscriminantAnalysis`](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html).

    Wykorzystaj zbiór walidacyjny do porównania modeli stosując dokładność, precyzję, pełność, F-miarę. Wykorzystaj zbiór testowy do ewaluacji najlepszego modelu. 

2. Klasyfikuj grzyby ze zbioru `agaricus-lepiota.data` jako trujące (`p` - *poisonous*) lub jadalne (`e` - *edible*) za pomocą [`CategoricalNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html).
    - Brakujące wartości zapisane są w zbiorze jako `?` (wczytując dane podaj `na_values='?'`). Usuń wiersze zawierające brakujące wartości (`dropna(axis='rows')`).
    - Dane wejściowe (`X`) koduj jako 0,1,2,... przy użyciu [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html).
    - Podziel dane na zbiór treningowy i testowy. Wypisz macierz omyłek oraz dokładność i F-miarę dla zbioru testowego. 

    

In [2]:
import pandas as pd

## Zadanie 1

In [18]:
diabetes = pd.read_csv("diabetes.csv")
diabetes.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [19]:
from sklearn.model_selection import train_test_split

In [20]:
X = diabetes.drop(["Outcome"], axis=1)
y = diabetes[["Outcome"]]

In [21]:
X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_full_train, y_full_train, test_size=0.875, random_state=42)

print(f"{len(X_train)}, {len(y_train)}")
print(f"{len(X_valid)}, {len(y_valid)}")
print(f"{len(X_test)}, {len(y_test)}")

76, 76
538, 538
154, 154


In [30]:
from sklearn.metrics import confusion_matrix

In [36]:
def calculate_confusion_matrix(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    accuracy = (tp+tn)/(tn+fp+fn+tp)
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f_score = (2 * precision * recall) / (precision + recall)
    
    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F-score: {f_score}")

In [37]:
from sklearn.preprocessing import StandardScaler

In [38]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_valid = scaler.transform(X_valid)

In [39]:
from sklearn.linear_model import LogisticRegression

In [42]:
lr = LogisticRegression().fit(X_train, y_train.values.ravel())
y_pred = lr.predict(X_valid)
calculate_confusion_matrix(y_valid, y_pred)

Accuracy: 0.6914498141263941
Precision: 0.5797101449275363
Recall: 0.425531914893617
F-score: 0.4907975460122699


In [41]:
from sklearn.naive_bayes import GaussianNB

In [45]:
gnb = GaussianNB().fit(X_train, y_train.values.ravel())
y_pred = gnb.predict(X_valid)
calculate_confusion_matrix(y_valid, y_pred)

Accuracy: 0.6914498141263941
Precision: 0.5625
Recall: 0.526595744680851
F-score: 0.5439560439560438


In [46]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [48]:
lda = LinearDiscriminantAnalysis().fit(X_train, y_train.values.ravel())
y_pred = lda.predict(X_valid)
calculate_confusion_matrix(y_valid, y_pred)

Accuracy: 0.6895910780669146
Precision: 0.5755395683453237
Recall: 0.425531914893617
F-score: 0.4892966360856269


In [49]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [50]:
qda = QuadraticDiscriminantAnalysis().fit(X_train, y_train.values.ravel())
y_pred = qda.predict(X_valid)
calculate_confusion_matrix(y_valid, y_pred)

Accuracy: 0.6561338289962825
Precision: 0.5081967213114754
Recall: 0.4946808510638298
F-score: 0.5013477088948787


## Zadanie 2

In [57]:
agaricus_lepiota = pd.read_csv("agaricus-lepiota.data", na_values="?")
agaricus_lepiota.head(5)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [58]:
agaricus_lepiota.dropna(axis="rows", inplace=True)

In [59]:
X = agaricus_lepiota.drop(["class"], axis=1)
y = agaricus_lepiota[["class"]]

In [60]:
from sklearn.preprocessing import OrdinalEncoder

In [62]:
oe = OrdinalEncoder()
X = oe.fit_transform(X)

In [65]:
X

array([[5., 2., 4., ..., 1., 3., 5.],
       [5., 2., 7., ..., 2., 2., 1.],
       [0., 2., 6., ..., 2., 2., 3.],
       ...,
       [5., 3., 3., ..., 5., 5., 4.],
       [5., 3., 1., ..., 5., 1., 0.],
       [2., 3., 1., ..., 5., 1., 0.]])

In [71]:
y

Unnamed: 0,class
0,p
1,e
2,e
3,p
4,e
...,...
7986,e
8001,e
8038,e
8095,p


In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [67]:
from sklearn.naive_bayes import CategoricalNB

In [74]:
cnb = CategoricalNB().fit(X_train, y_train.values.ravel())
y_pred = cnb.predict(X_test)
calculate_confusion_matrix(y_test, y_pred)

Accuracy: 0.9663418954827281
Precision: 0.9923469387755102
Recall: 0.9174528301886793
F-score: 0.9534313725490197


In [76]:
y.value_counts()

class
e        3488
p        2156
Name: count, dtype: int64