# Exercise 4 - Model Evaluation
Idag ska vi se hur bra våra modeller egentligen är. Vi kommer att tackla ett klassificeringsproblem (titanic - decisiontree, random forest och XGBoost) och ett regressionsproblem (huspriser, Model 1, Model 2 och Model 3) och skapa mått på hur bra dessa modeller är. Vi ska även skapa baselines för att ha en referenspunkt på hur bra våra modeller är i förhållande till något annat än ML.

## Klassificeringsproblem - Titanic
Note: Använd samma kod som i Exercise 2 för att generera tre modeller som predikterar vilka som kommer att överleva titanic.
1. Ladda in, städa och dela upp träningssettet titanic.csv
2. Skapa och träna fyra klassificeringsmodeller (Decision Tree, Random Forest, XGBoost, SVM).
3. För varje modell, ta fram måtten:
    - Accuracy
    - Precision
    - Recall
    - F1, F2, F0.5
4. Vilken modell presterar bäst? Skiljer sig modellens prestanda från Exercise 2 när vi endast undersökte modellens Accuracy?

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('titanic.csv')

# En schysst funktion för att göra någonting av kabin-nummer. Henrik delar upp cabin i CabinSection och CabinNumber 
# (baserat på bokstaven framför, och siffran efteråt).
def dictionary_function(df, col):
    my_value_list = sorted(list(set([item[0] for item in list(set(list(df[col].values)))])))
    my_ranking_list = list(range(len(my_value_list)))
    my_dictionary = {}
    for x,y in zip(my_value_list, my_ranking_list):
        my_dictionary[x] = y
    df.replace({col: my_dictionary}, inplace=True)
    return df

# Dumma kolumner tas bort.
dumb_cols = ['Unnamed: 0', 'Unnamed: 0.1']
for col in dumb_cols:
    if col in df.columns:
        df.drop(columns={col}, inplace=True)

# Kolumnen Survived har en jäkla massa null-värden. Vi får ta bort dessa. Inte kul, men raderna ger oss ingenting.
df = df[df['Survived'].notna()]

# Kolumnen Survived är strängar just nu. Vi gör om det till ints.
df['Survived'].replace({'Yes': 1, 'No':0}, inplace=True)
df['Survived'] = df['Survived'].astype(int)

# Vi fyller på avsaknade värden i kolumnen Age med genomsnittlig ålder. Vi 
df['Age'].fillna(round(df['Age'].mean()), inplace=True)
df['Age'] = df['Age'].apply(lambda x: df['Age'].mean() if x > 120 else x)
df['Age'] = df['Age'].astype(int)

# Vi fyller på avsaknade värden i kolumnen Embarked med Unknown.
df.Embarked.fillna('U', inplace=True)

# Vi fyller på avsaknade värden i kolumnen Cabin med U0.
df.Cabin.fillna('U0', inplace=True)

# Vi gör om kön till 1 och 0
df['Sex'] = df['Sex'].replace({'male':1,'female':0})

# Vi splittar vår kolumn Cabin i section och hyttnymmer.
# Sen använder vi vår schyssta funktion för att extrahera kabinsektion och hyttnummer.
df[['CabinSection', 'CabinNr', 'dummy']] = df["Cabin"].str.split("(\d+)", n=1, expand=True)
df.CabinSection = df.CabinSection.apply(lambda x: x[0])
df.CabinNr.fillna(0, inplace=True)
df.CabinNr = df.CabinNr.astype(int)
for col in ['CabinSection', 'Embarked']:
    df = dictionary_function(df, col)
    df[col].astype(int)

# Vi droppar alla konstiga kolumner som dykt upp under processen och som inte påverkar överlevnadschanserna.
df.drop(columns={'dummy', 'Cabin', 'Ticket', 'Name', 'PassengerId'}, inplace=True)

# Vi skalar/normaliserar vårt dataset
target_col = ['Survived']
feature_cols = [col for col in df.columns if col not in target_col]


# Vi definierar vårt target och våra features.
y = df[target_col]
X = df[feature_cols]


# Vi delar upp vårt dataset i train och test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler_fit = scaler.fit(X_train)
X_train = pd.DataFrame(scaler_fit.transform(X_train), index=X_train.index, columns=feature_cols)
X_test = pd.DataFrame(scaler_fit.transform(X_test), index=X_test.index, columns=feature_cols)

### Model 1: Decision Tree

In [19]:
from sklearn import tree
import matplotlib.pyplot as plt
model = tree.DecisionTreeClassifier() # optimize! Can we do it better?
model.fit(X_train, y_train)
predictions_decision_tree = list(model.predict(X_test))

### Model 2: Random Forest

In [20]:
from sklearn.ensemble import RandomForestClassifier
model_forest = RandomForestClassifier() # optimize! Can we do it better?
model_forest.fit(X_train, y_train)
predictions_random_forest = list(model_forest.predict(X_test))

  model_forest.fit(X_train, y_train)


### Model 3: XGBoost

In [21]:
import xgboost as xgb
model_XGB = xgb.XGBClassifier() # optimize! Can we do it better?
model_XGB.fit(X_train, y_train)
predictions_XGB = list(model_XGB.predict(X_test))

### Model 4: SVM

In [22]:
from sklearn.svm import SVC
model_SVC = SVC() # optimize! Can we do it better?
model_SVC.fit(X_train, y_train)
predictions_SVC = list(model_SVC.predict(X_test))

  y = column_or_1d(y, warn=True)


### Model 4: Neural Network (Classification)

In [35]:
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam
from keras.utils import to_categorical

y_train_nn = to_categorical(y_train, num_classes=len(y_train.value_counts()))
y_test_nn = to_categorical(y_test, num_classes=len(y_train.value_counts()))

model_nn = Sequential()

model_nn.add(Dense(200, input_dim=len(X_train.columns), activation='relu')) #input_layer + first hidden layer
model_nn.add(Dense(100, activation='relu'))
model_nn.add(Dense(50, activation='relu'))
model_nn.add(Dense(25, activation='relu'))
model_nn.add(Dense(12, activation='relu'))
model_nn.add(Dense(len(y_train.value_counts()), activation='sigmoid')) # output

model_nn.compile(loss='categorical_crossentropy', optimizer=Adam())

model_nn.fit(X_train, y_train_nn,
             epochs=100,
             verbose=0,
             validation_split=0.2,
             batch_size=32)

predictions_nn = list(model_nn.predict(X_test))

import numpy as np
predictions_nn_nice = np.argmax(predictions_nn, axis=1)

<keras.callbacks.History at 0x7fee80fe9640>

### Model Evaluation - Classification
- Accuracy
- Precision
- Recall
- F1, F2, F0.5

In [41]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
our_prediction_list = [predictions_decision_tree, predictions_random_forest, predictions_XGB, predictions_SVC, predictions_nn_nice]
model_list = ['Decision Tree', 'Random Forest', 'XGB', 'SVC', 'NN']
utfall = y_test['Survived'].to_list()

for preds, modl in zip(our_prediction_list, model_list):
    precision = precision_score(utfall, preds, labels=['Died','Survived'])
    recall = recall_score(utfall, preds, labels=['Died','Survived'])
    accuracy = accuracy_score(utfall, preds)
    f1 = f1_score(utfall, preds, labels=['Died','Survived'])
    print(modl)
    print(f'Accuracy: {round(accuracy,3)}')
    print(f'Precision: {round(precision, 3)}')
    print(f'Recall: {round(recall, 3)}')
    print(f'F1-score: {round(f1, 3)}')
    print(' ')

Decision Tree
Accuracy: 0.761
Precision: 0.742
Recall: 0.649
F1-score: 0.692
 
Random Forest
Accuracy: 0.802
Precision: 0.809
Recall: 0.685
F1-score: 0.741
 
XGB
Accuracy: 0.772
Precision: 0.745
Recall: 0.685
F1-score: 0.714
 
SVC
Accuracy: 0.795
Precision: 0.85
Recall: 0.613
F1-score: 0.712
 
NN
Accuracy: 0.784
Precision: 0.798
Recall: 0.64
F1-score: 0.71
 


## Regressionproblem - Huspriser
1. Ladda in, städa och dela upp träningssettet housing.csv
2. Skapa och träna fyra regressionsmodeller (Linear Regression, Random Forest Regressor, XGB Regressor, SVM).
3. För varje modell, ta fram måtten:
    - Mean Squared Error
    - Root Mean Squared Error
    - R2-score
    - Mean Absolute Error
4. Vilken modell presterar bäst?

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os
df = pd.read_csv('housing.csv')

# Clean and fix the data
df.drop(columns={'Id'}, inplace=True)
one_hot_columns = ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle',
                'RoofStyle', 'RoofMatl', 'Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual', 'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
                 'Heating', 'HeatingQC','CentralAir','Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive','PoolQC','Fence',
                'MiscFeature','SaleType','SaleCondition', 'GarageYrBlt', 'MasVnrArea']
df.drop(columns=one_hot_columns, inplace=True)
df['LotFrontage'].fillna(0, inplace=True)

target_col = ['SalePrice']
feature_cols = [col for col in df.columns if col not in target_col]

y = df[target_col]
X = df[feature_cols]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
scaler_fit = scaler.fit(X_train)

scaler_y = MinMaxScaler()
scaler_y_fit = scaler_y.fit(y_train)


X_train = pd.DataFrame(scaler_fit.transform(X_train), index=X_train.index, columns=feature_cols)
X_test = pd.DataFrame(scaler_fit.transform(X_test), index=X_test.index, columns=feature_cols)


#y_train = pd.DataFrame(scaler_y.transform(y_train), index=y_train.index, columns=y_train.columns)
#y_test = pd.DataFrame(scaler_y.transform(y_test), index=y_test.index, columns=y_test.columns)


### Model 1: Linear Regression

In [2]:
from sklearn.linear_model import LinearRegression
model_LinearRegression = LinearRegression()
model_LinearRegression.fit(X_train, y_train)
predictions_LinearRegression = list(model_LinearRegression.predict(X_test))


### Model 2: Random Forest Regressor

In [3]:
from sklearn.ensemble import RandomForestRegressor
model_RandomForestRegressor = RandomForestRegressor()
model_RandomForestRegressor.fit(X_train, y_train)
predictions_RandomForestRegressor = list(model_RandomForestRegressor.predict(X_test))

  model_RandomForestRegressor.fit(X_train, y_train)


### Model 3: XGB Regressor

In [4]:
from xgboost import XGBRegressor
model_XGBRegressor = XGBRegressor()
model_XGBRegressor.fit(X_train, y_train)
predictions_XGBRegressor = list(model_XGBRegressor.predict(X_test))

In [8]:
len(X_train.columns)

34

### Model 4: Neural Network (Regression)

In [14]:
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam


model_nn = Sequential()

model_nn.add(Dense(200, input_dim=len(X_train.columns), activation='relu')) #input_layer + first hidden layer
model_nn.add(Dense(100, activation='relu'))
model_nn.add(Dense(50, activation='relu'))
model_nn.add(Dense(25, activation='relu'))
model_nn.add(Dense(12, activation='relu'))
model_nn.add(Dense(1, activation='relu'))

model_nn.compile(loss='mean_squared_error', optimizer=Adam())

model_nn.fit(X_train, y_train,
             epochs=100,
             verbose=0,
             validation_split=0.2,
             batch_size=124)




<keras.callbacks.History at 0x7fee79860fd0>

In [15]:
predictions_nn = list(model_nn.predict(X_test))



In [16]:
predictions_nn

[array([149618.94], dtype=float32),
 array([279173.75], dtype=float32),
 array([123360.516], dtype=float32),
 array([174868.47], dtype=float32),
 array([261202.44], dtype=float32),
 array([85989.9], dtype=float32),
 array([211697.64], dtype=float32),
 array([172826.16], dtype=float32),
 array([82426.3], dtype=float32),
 array([158349.3], dtype=float32),
 array([139198.19], dtype=float32),
 array([131597.78], dtype=float32),
 array([138856.78], dtype=float32),
 array([194943.67], dtype=float32),
 array([210846.69], dtype=float32),
 array([151063.66], dtype=float32),
 array([217291.27], dtype=float32),
 array([139309.19], dtype=float32),
 array([119104.19], dtype=float32),
 array([211538.89], dtype=float32),
 array([198738.62], dtype=float32),
 array([205404.12], dtype=float32),
 array([206205.62], dtype=float32),
 array([153748.22], dtype=float32),
 array([204675.66], dtype=float32),
 array([159929.78], dtype=float32),
 array([193970.62], dtype=float32),
 array([114584.914], dtype=float

### Model Evaluation - Regression
- Mean Squared Error
- Root Mean Squared Error
- R2-score
- Mean Absolute Error

In [17]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

our_prediction_list = [predictions_LinearRegression, predictions_RandomForestRegressor, predictions_XGBRegressor, predictions_nn]
model_list = ['Linear Regression', 'Random Forest Regressor', 'XGB Regressor', 'NN']
utfall = y_test['SalePrice'].to_list()

for preds, modl in zip(our_prediction_list, model_list):
    MSE = mean_squared_error(utfall, preds, squared=True)
    RMSE = mean_squared_error(utfall, preds, squared=False)
    MAE = mean_absolute_error(utfall, preds)
    R2 = r2_score(utfall, preds)
    
    print(modl)
    print(f'Mean Squared Error: {round(MSE,3)}')
    print(f'Root Mean Squared Error: {round(RMSE, 3)}')
    print(f'Mean Absolute Error: {round(MAE, 3)}')
    print(f'R2-score: {round(R2, 3)}')
    print(' ')
    
    

Linear Regression
Mean Squared Error: 1614652313.764
Root Mean Squared Error: 40182.737
Mean Absolute Error: 24296.222
R2-score: 0.78
 
Random Forest Regressor
Mean Squared Error: 1006802120.62
Root Mean Squared Error: 31730.145
Mean Absolute Error: 18097.275
R2-score: 0.863
 
XGB Regressor
Mean Squared Error: 1124300111.258
Root Mean Squared Error: 33530.585
Mean Absolute Error: 18484.458
R2-score: 0.847
 
NN
Mean Squared Error: 2266999115.873
Root Mean Squared Error: 47613.014
Mean Absolute Error: 28049.084
R2-score: 0.691
 
