# Supervised learning - Classification
Goal of this excercise is to complete the hands-on experience task with similar task description as in the classification project case.

We will use the modified Household Prices Dataset.

Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Important attributes description:
* SalePrice: The property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* Heating: Type of heating
* CentralAir: Central air conditioning
* GrLivArea: Above grade (ground) living area square feet
* BedroomAbvGr: Number of bedrooms above basement level)

### Complete the following tasks:
1. **Describe what operations you are performing for each of the features**
    - Mainly focus on categorical features nahrazeni categorickych typu za čísla
2. Answer the following questions:
    - **How many values are missing?**
    - **How many instances do you have in each of the classes?**
    - **Which metric score do you propose for the classification model performance evaluation?** accuracyscore vyvazene tridy 100 jednicek a 10 nulv tomhle by vyslo accuracy 90% ,f1score pro nevyvazene 90 1 a 10 nul,precision recall
        - Hint: This depends on your previous answer
3. Finish your preprocessing pipeline and split the data into the Input and Output part (i.e. X and y variables)
4. Start with the Decision tree model
    - Use 5-fold cross validation
    - **Will you use *standard* cross validation or *stratified* cross validation? Why?** strat stejný poměr nul a jedniček v trenovaci a v testovaci sadě, cross vybira data nahodně
    - Compute mean of the obtained score values
5. Select one other algorithm from https://scikit-learn.org/stable/supervised_learning.html randomforest? run?
    - Repeat the 5-fold CV
6. **Write down which model is better and why**
7. Do **5 experiments** with hyper-parameters
    - Set the parameters
    - Do the 5-fold CV
    - Note the settings and score in the Markdown cell
8. **Write down  which model is the best and why**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix, auc
from sklearn.preprocessing import OrdinalEncoder
from sklearn.neural_network import MLPClassifier

## We will use categorized price as a target variable
- Our goal is to predict if the house will be sold for more than 250k USD or not

In [2]:
df = pd.read_csv('zsu_cv1_data.csv').loc[:, ['SalePrice','MSSubClass','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','Heating','CentralAir','GrLivArea','BedroomAbvGr']]
df.loc[:, ['Target']] = (df.SalePrice > 250000).astype(int)
df = df.drop(['SalePrice'], axis=1)

In [3]:
df.head()

Unnamed: 0,MSSubClass,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,Heating,CentralAir,GrLivArea,BedroomAbvGr,Target
0,60,1Fam,2Story,7,5,2003,GasA,Y,1710,3,0
1,20,1Fam,1Story,6,8,1976,GasA,Y,1262,3,0
2,60,1Fam,2Story,7,5,2001,GasA,Y,1786,3,0
3,70,1Fam,2Story,7,5,1915,GasA,Y,1717,3,0
4,60,1Fam,2Story,8,5,2000,GasA,Y,2198,4,0


dva nahradit 1 a nula přes applay
ordinal 0 až n pokud zaleži na pořadi
když nemám pořadí onehotencoding, moc unikatnich hodnot moc sloupcu,
hodně unikatnich hodnot dropnout neřešit.
onehot=dummies


In [4]:
df.sort_values(by="Target", ascending=False).head(10)

Unnamed: 0,MSSubClass,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,Heating,CentralAir,GrLivArea,BedroomAbvGr,Target
798,60,1Fam,2Story,9,5,2008,GasA,Y,3140,4,1
1289,60,1Fam,2Story,8,5,2006,GasA,Y,1970,3,1
539,20,1Fam,1Story,8,5,2001,GasA,Y,1601,3,1
540,20,1Fam,1Story,9,5,2006,GasA,Y,1838,2,1
1302,60,1Fam,2Story,8,5,1994,GasA,Y,2526,4,1
664,20,1Fam,1Story,8,5,2005,GasA,Y,2097,1,1
185,75,1Fam,2.5Fin,10,9,1892,GasA,Y,3608,4,1
906,20,1Fam,1Story,8,5,2006,GasA,Y,1636,3,1
359,60,1Fam,2Story,8,5,1998,GasA,Y,1924,3,1
661,60,1Fam,2Story,8,7,1994,GasA,Y,2448,4,1


## Take a look at the features

In [None]:
df.describe()

Unnamed: 0,MSSubClass,OverallQual,OverallCond,YearBuilt,GrLivArea,BedroomAbvGr,Target
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,6.099315,5.575342,1971.267808,1515.463699,2.866438,0.14863
std,42.300571,1.382997,1.112799,30.202904,525.480383,0.815778,0.355845
min,20.0,1.0,1.0,1872.0,334.0,0.0,0.0
25%,20.0,5.0,5.0,1954.0,1129.5,2.0,0.0
50%,50.0,6.0,5.0,1973.0,1464.0,3.0,0.0
75%,70.0,7.0,6.0,2000.0,1776.75,3.0,0.0
max,190.0,10.0,9.0,2010.0,5642.0,8.0,1.0


In [None]:
df.describe(exclude=np.number)

Unnamed: 0,BldgType,HouseStyle,Heating,CentralAir
count,1460,1460,1460,1460
unique,5,8,6,2
top,1Fam,1Story,GasA,Y
freq,1220,726,1428,1365


In [None]:
df.Target.value_counts()

0    1243
1     217
Name: Target, dtype: int64

# Task (2p)
- Finished the proposed tasks

**Write down conclusion to the Markdown cell**

In [None]:
df.isna().sum().sort_values(ascending=False)

MSSubClass      0
BldgType        0
HouseStyle      0
OverallQual     0
OverallCond     0
YearBuilt       0
Heating         0
CentralAir      0
GrLivArea       0
BedroomAbvGr    0
Target          0
dtype: int64

## Encode CentralAir

In [5]:
df.CentralAir = df.CentralAir.replace({'N':0,'Y':1})

In [7]:
df.CentralAir

0       1
1       1
2       1
3       1
4       1
       ..
1455    1
1456    1
1457    1
1458    1
1459    1
Name: CentralAir, Length: 1460, dtype: int64

## Encode BldgType

In [8]:
df = pd.concat([df,pd.get_dummies(df.BldgType, prefix='BldgType')],axis=1).drop('BldgType',axis=1)

In [10]:
df.head()

Unnamed: 0,MSSubClass,HouseStyle,OverallQual,OverallCond,YearBuilt,Heating,CentralAir,GrLivArea,BedroomAbvGr,Target,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
0,60,2Story,7,5,2003,GasA,1,1710,3,0,1,0,0,0,0
1,20,1Story,6,8,1976,GasA,1,1262,3,0,1,0,0,0,0
2,60,2Story,7,5,2001,GasA,1,1786,3,0,1,0,0,0,0
3,70,2Story,7,5,1915,GasA,1,1717,3,0,1,0,0,0,0
4,60,2Story,8,5,2000,GasA,1,2198,4,0,1,0,0,0,0


In [11]:
df.describe(exclude=np.number)

Unnamed: 0,HouseStyle,Heating
count,1460,1460
unique,8,6
top,1Story,GasA
freq,726,1428


## Encode HouseStyle

In [12]:
sorted(df.HouseStyle.unique())

['1.5Fin', '1.5Unf', '1Story', '2.5Fin', '2.5Unf', '2Story', 'SFoyer', 'SLvl']

In [13]:
df = pd.concat([df,pd.get_dummies(df.HouseStyle, prefix='HouseStyle')],axis=1).drop('HouseStyle',axis=1)

In [15]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,MSSubClass,OverallQual,OverallCond,YearBuilt,Heating,CentralAir,GrLivArea,BedroomAbvGr,Target,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl
0,60,7,5,2003,GasA,1,1710,3,0,1,0,0,0,0,0,0,0,0,0,1,0,0
1,20,6,8,1976,GasA,1,1262,3,0,1,0,0,0,0,0,0,1,0,0,0,0,0
2,60,7,5,2001,GasA,1,1786,3,0,1,0,0,0,0,0,0,0,0,0,1,0,0
3,70,7,5,1915,GasA,1,1717,3,0,1,0,0,0,0,0,0,0,0,0,1,0,0
4,60,8,5,2000,GasA,1,2198,4,0,1,0,0,0,0,0,0,0,0,0,1,0,0


## Encode Heating

In [16]:
sorted(df.Heating.unique())

['Floor', 'GasA', 'GasW', 'Grav', 'OthW', 'Wall']

In [17]:
df = pd.concat([df,pd.get_dummies(df.Heating, prefix='Heating')],axis=1).drop('Heating',axis=1)

In [18]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,MSSubClass,OverallQual,OverallCond,YearBuilt,CentralAir,GrLivArea,BedroomAbvGr,Target,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall
0,60,7,5,2003,1,1710,3,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
1,20,6,8,1976,1,1262,3,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
2,60,7,5,2001,1,1786,3,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
3,70,7,5,1915,1,1717,3,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
4,60,8,5,2000,1,2198,4,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0


## Kontrola typu

In [19]:
df.dtypes

MSSubClass           int64
OverallQual          int64
OverallCond          int64
YearBuilt            int64
CentralAir           int64
GrLivArea            int64
BedroomAbvGr         int64
Target               int64
BldgType_1Fam        uint8
BldgType_2fmCon      uint8
BldgType_Duplex      uint8
BldgType_Twnhs       uint8
BldgType_TwnhsE      uint8
HouseStyle_1.5Fin    uint8
HouseStyle_1.5Unf    uint8
HouseStyle_1Story    uint8
HouseStyle_2.5Fin    uint8
HouseStyle_2.5Unf    uint8
HouseStyle_2Story    uint8
HouseStyle_SFoyer    uint8
HouseStyle_SLvl      uint8
Heating_Floor        uint8
Heating_GasA         uint8
Heating_GasW         uint8
Heating_Grav         uint8
Heating_OthW         uint8
Heating_Wall         uint8
dtype: object

## Rozdělení dat na vstupní a výstupní část

In [21]:
X, y = df.loc[:, df.columns != 'Target'], df.loc[:, 'Target']

In [22]:
X.head()

Unnamed: 0,MSSubClass,OverallQual,OverallCond,YearBuilt,CentralAir,GrLivArea,BedroomAbvGr,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall
0,60,7,5,2003,1,1710,3,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
1,20,6,8,1976,1,1262,3,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
2,60,7,5,2001,1,1786,3,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
3,70,7,5,1915,1,1717,3,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
4,60,8,5,2000,1,2198,4,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0


In [23]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Target, dtype: int64

## DecisionTreeClassifier

In [24]:
skf = StratifiedKFold(n_splits=5)
scores = list()
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))
    print(f'Ratio in train set: {y_train.value_counts(normalize=True)[1]:.2}; Ratio in test set: {y_test.value_counts(normalize=True)[1]:.2}')
    
scores

Rratio in train set: 0.15; Ratio in test set: 0.15
Rratio in train set: 0.15; Ratio in test set: 0.15
Rratio in train set: 0.15; Ratio in test set: 0.15
Rratio in train set: 0.15; Ratio in test set: 0.15
Rratio in train set: 0.15; Ratio in test set: 0.15


[0.6521739130434783,
 0.7764705882352941,
 0.6265060240963854,
 0.7032967032967034,
 0.6666666666666667]

In [25]:
np.mean(scores), np.min(scores), np.max(scores)

(0.6850227790677057, 0.6265060240963854, 0.7764705882352941)

## RandomForestClassifier

In [26]:
skf = StratifiedKFold(n_splits=5)
scores = list()
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = RandomForestClassifier()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))
    print(f'Ratio in train set: {y_train.value_counts(normalize=True)[1]:.2}; Ratio in test set: {y_test.value_counts(normalize=True)[1]:.2}')
    
scores

Ratio in train set: 0.15; Ratio in test set: 0.15
Ratio in train set: 0.15; Ratio in test set: 0.15
Ratio in train set: 0.15; Ratio in test set: 0.15
Ratio in train set: 0.15; Ratio in test set: 0.15
Ratio in train set: 0.15; Ratio in test set: 0.15


[0.7209302325581395,
 0.8333333333333333,
 0.6944444444444445,
 0.8148148148148148,
 0.7073170731707317]

In [27]:
np.mean(scores), np.min(scores), np.max(scores)

(0.7541679796642928, 0.6944444444444445, 0.8333333333333333)