### Zaawansowane Metody Uczenia Maszynowego

#### Laboratorium 5

##### Model regresji logistycznej z interakcjami

In [20]:
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic = titanic[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embark_town"]]
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town
0,0,3,male,22.0,1,0,7.25,Southampton
1,1,1,female,38.0,1,0,71.2833,Cherbourg
2,1,3,female,26.0,0,0,7.925,Southampton
3,1,1,female,35.0,1,0,53.1,Southampton
4,0,3,male,35.0,0,0,8.05,Southampton


- survived - czy pasażer przeżył (1, 0)
- pclass - klasa w jakiej podróżował (1, 2, 3)
- sex - płeć
- age - wiek pasażera
- sibsp - liczba rodzeństwa/małżonków na pokładzie
- parch - liczba rodziców/dzieci na pokładzie
- fare - opłata za bilet
- embark_town - port, w którym pasażer wszedł na podkład

### Zadanie 1
----
Zbuduj model regresji logistycznej dla zmiennej `survived`.

*Note: pamiętaj o wstępnej analizie danych i przygotowaniu ich do dalszej pracy, możesz użyć `pipeline` oraz optymalizacji hiperparametrów.*

In [21]:
y = titanic.survived
X = titanic.drop(["survived"], axis=1)

In [22]:
numerical_features = ['age', 'sibsp', 'parch', 'fare']
categorical_features = ['sex', 'embark_town', 'pclass']

In [23]:
print('Numerical features:', numerical_features)
print('Categorical features:', categorical_features)

Numerical features: ['age', 'sibsp', 'parch', 'fare']
Categorical features: ['sex', 'embark_town', 'pclass']


In [24]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

## transformacja zmiennych numerycznych
numerical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy = 'mean')),
    ('scaler', StandardScaler())
])

## transformacja zmiennych kategorycznych
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])

from sklearn.compose import ColumnTransformer


preprocessor = ColumnTransformer([
    ('numerical', numerical_transformer, numerical_features),
    ('categorical', categorical_transformer, categorical_features)
])

## pipeline z modelem regresji logistycznej
pipeline = Pipeline([
    ('pre', preprocessor),
    ('glm', LogisticRegression(penalty = None)) #lub "none" zależne od wersji, domyślnie w LogisticRegression() jest L2
])

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [26]:
glm = pipeline.fit(X_train, y_train)

In [27]:
train_score = pipeline.score(X_train, y_train)
print('Accuracy score on the training set:', np.round(train_score, 5))

test_score = pipeline.score(X_test, y_test)
print('Accuracy score on the training set:', np.round(test_score, 5))

Accuracy score on the training set: 0.82022
Accuracy score on the training set: 0.79478


In [28]:
# współczynniki w modelu regresji logostycznej

for a, b in zip(preprocessor.get_feature_names_out(), np.round(pipeline['glm'].coef_[0], 4)):
    print(a, ": ", b)

numerical__age :  -0.4736
numerical__sibsp :  -0.3905
numerical__parch :  -0.1223
numerical__fare :  0.1199
categorical__sex_female :  1.45
categorical__sex_male :  -1.2807
categorical__embark_town_Cherbourg :  0.3818
categorical__embark_town_Queenstown :  0.2453
categorical__embark_town_Southampton :  -0.4578
categorical__pclass_1 :  1.2361
categorical__pclass_2 :  0.0941
categorical__pclass_3 :  -1.161


### Zadania 2
---
W jaki sposób dodaje się interakcje w modelu regresji logistycznej?


Dla danych z *Zadania 1* zaproponuj możliwe interakcje w danych i przygotuj odpowiedni model?
Czy zaproponowane interakcje faktycznie występują?



Interakcje dodajemy poprzes stworzenie nowych zmiennych w ramce danych. Można zrobić to ręcznie lub używająć funkcji [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

In [29]:
## transformacja ramki danych z pipeline (Zadanie 1)
df = preprocessor.fit_transform(titanic)

In [30]:
df = pd.DataFrame(df, columns=preprocessor.get_feature_names_out())

In [31]:
df

Unnamed: 0,numerical__age,numerical__sibsp,numerical__parch,numerical__fare,categorical__sex_female,categorical__sex_male,categorical__embark_town_Cherbourg,categorical__embark_town_Queenstown,categorical__embark_town_Southampton,categorical__pclass_1,categorical__pclass_2,categorical__pclass_3
0,-0.592481,0.432793,-0.473674,-0.502445,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.638789,0.432793,-0.473674,0.786845,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,-0.284663,-0.474545,-0.473674,-0.488854,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,0.407926,0.432793,-0.473674,0.420730,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
4,0.407926,-0.474545,-0.473674,-0.486337,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,-0.207709,-0.474545,-0.473674,-0.386671,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
887,-0.823344,-0.474545,-0.473674,-0.044381,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
888,0.000000,0.432793,2.008933,-0.176263,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
889,-0.284663,-0.474545,-0.473674,-0.044381,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0


#### Nowe zmienne ręcznie

In [32]:
## pclass * sex
## age * sex

df["categorical__pclass_1_categorical__sex_male"] = df["categorical__pclass_1"] * df["categorical__sex_male"]
df["categorical__pclass_2_categorical__sex_male"] = df["categorical__pclass_2"] * df["categorical__sex_male"]
df["categorical__pclass_3_categorical__sex_male"] = df["categorical__pclass_3"] * df["categorical__sex_male"]
df["categorical__pclass_1_categorical__sex_female"] = df["categorical__pclass_1"] * df["categorical__sex_female"]
df["categorical__pclass_2_categorical__sex_female"] = df["categorical__pclass_2"] * df["categorical__sex_female"]
df["categorical__pclass_3_categorical__sex_female"] = df["categorical__pclass_3"] * df["categorical__sex_female"]

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = 0.3)

In [35]:
glm2 = LogisticRegression(penalty=None)
glm2.fit(X_train, y_train)

In [36]:
train_score = glm2.score(X_train, y_train)
print('Accuracy score on the training set:', np.round(train_score, 5))

test_score = glm2.score(X_test, y_test)
print('Accuracy score on the training set:', np.round(test_score, 5))

Accuracy score on the training set: 0.81701
Accuracy score on the training set: 0.79851


In [38]:
# współczynniki w modelu regresji logostycznej z interakcjami

for a, b in zip(df.columns, np.round(glm2.coef_[0], 4)):
    print(a, ": ", b)

numerical__age :  -0.4727
numerical__sibsp :  -0.3103
numerical__parch :  -0.0425
numerical__fare :  0.1543
categorical__sex_female :  1.2651
categorical__sex_male :  -1.0706
categorical__embark_town_Cherbourg :  0.3225
categorical__embark_town_Queenstown :  0.107
categorical__embark_town_Southampton :  -0.235
categorical__pclass_1 :  0.8405
categorical__pclass_2 :  0.1243
categorical__pclass_3 :  -0.7703
categorical__pclass_1_categorical__sex_male :  0.0453
categorical__pclass_2_categorical__sex_male :  -0.9515
categorical__pclass_3_categorical__sex_male :  -0.1643
categorical__pclass_1_categorical__sex_female :  0.7953
categorical__pclass_2_categorical__sex_female :  1.0759
categorical__pclass_3_categorical__sex_female :  -0.606


#### Nowe zmienne przy użyciu `PolynomialFeatures`

In [39]:
df = preprocessor.fit_transform(titanic)
df = pd.DataFrame(df, columns=preprocessor.get_feature_names_out())

In [56]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(interaction_only=True)
poly.fit_transform(df)
df_new = pd.DataFrame(poly.fit_transform(df), columns=poly.get_feature_names_out(df.columns))


In [57]:
df_new

Unnamed: 0,1,numerical__age,numerical__sibsp,numerical__parch,numerical__fare,categorical__sex_female,categorical__sex_male,categorical__embark_town_Cherbourg,categorical__embark_town_Queenstown,categorical__embark_town_Southampton,...,categorical__embark_town_Queenstown categorical__embark_town_Southampton,categorical__embark_town_Queenstown categorical__pclass_1,categorical__embark_town_Queenstown categorical__pclass_2,categorical__embark_town_Queenstown categorical__pclass_3,categorical__embark_town_Southampton categorical__pclass_1,categorical__embark_town_Southampton categorical__pclass_2,categorical__embark_town_Southampton categorical__pclass_3,categorical__pclass_1 categorical__pclass_2,categorical__pclass_1 categorical__pclass_3,categorical__pclass_2 categorical__pclass_3
0,1.0,-0.592481,0.432793,-0.473674,-0.502445,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.638789,0.432793,-0.473674,0.786845,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,-0.284663,-0.474545,-0.473674,-0.488854,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.407926,0.432793,-0.473674,0.420730,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.407926,-0.474545,-0.473674,-0.486337,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,1.0,-0.207709,-0.474545,-0.473674,-0.386671,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
887,1.0,-0.823344,-0.474545,-0.473674,-0.044381,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
888,1.0,0.000000,0.432793,2.008933,-0.176263,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
889,1.0,-0.284663,-0.474545,-0.473674,-0.044381,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_new, y, test_size = 0.3)

In [70]:
glm3 = LogisticRegression() 
glm3.fit(X_train, y_train)

In [71]:
train_score = glm3.score(X_train, y_train)
print('Accuracy score on the training set:', np.round(train_score, 5))

test_score = glm3.score(X_test, y_test)
print('Accuracy score on the training set:', np.round(test_score, 5))

Accuracy score on the training set: 0.83467
Accuracy score on the training set: 0.81716


In [72]:
# współczynniki w modelu regresji logostycznej z interakcjami

for a, b in zip(df_new.columns, np.round(glm3.coef_[0], 4)):
    print(a, ": ", b)

1 :  0.001
numerical__age :  -0.1517
numerical__sibsp :  -0.0162
numerical__parch :  0.1654
numerical__fare :  0.1471
categorical__sex_female :  1.0195
categorical__sex_male :  -1.0186
categorical__embark_town_Cherbourg :  0.283
categorical__embark_town_Queenstown :  -0.1911
categorical__embark_town_Southampton :  -0.0909
categorical__pclass_1 :  0.335
categorical__pclass_2 :  0.0569
categorical__pclass_3 :  -0.3909
numerical__age numerical__sibsp :  0.2023
numerical__age numerical__parch :  -0.6035
numerical__age numerical__fare :  0.4297
numerical__age categorical__sex_female :  0.0955
numerical__age categorical__sex_male :  -0.2471
numerical__age categorical__embark_town_Cherbourg :  -0.2262
numerical__age categorical__embark_town_Queenstown :  -0.0464
numerical__age categorical__embark_town_Southampton :  0.1209
numerical__age categorical__pclass_1 :  -0.0284
numerical__age categorical__pclass_2 :  -0.184
numerical__age categorical__pclass_3 :  0.0607
numerical__sibsp numerical__pa

### Komitety modeli

Jakie znamy metody budowania komitetów modeli?

- bagging
- boosting (gradient boosting)
- voting
- stacking
- random forest
- extra trees

### Zadanie 3
---
Wykorzystując wiedzę o różnych sposobach budowania komitetów modeli zaproponuj różne modele dla danych `credit-g`.

In [None]:
from sklearn.datasets import fetch_openml

df = fetch_openml(data_id = 31)
y = df.target
X = df.data

### Zadanie 4*
---
Zaimplementuj metodę stackingu modeli.

### Zadanie 5*
---
Zaimplemetuj metodę baggingu.