# Practice Choosing Models

## Setup

In [None]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score # to split the data
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, fbeta_score #To evaluate our model
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

# Algorithmns models to be compared
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer

import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('german_credit_data.csv', index_col=0)
data = load_breast_cancer()
df2 = pd.DataFrame(data.data, columns=data.feature_names)
df2['target'] = data.target

## Intro

In the german_credit_data dataset we are trying to predict if an individual is credit worthy or not by classifying an entry as good or bad in the 'Risk' value

In the second dataset we are trying to predict if a patient has breast cancer, which is labeld in the 'target' column.

## Testing Methods

In [None]:
def test_dataset(test_val, df):   
    # Split dataset into training and testing sets
    X = df.drop(test_val, axis=1)
    y = df[test_val]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Scale the features using StandardScaler
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)

    models = []
    models.append(('LR', LogisticRegression()))
    models.append(('LDA', LinearDiscriminantAnalysis()))
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('CART', DecisionTreeClassifier()))
    models.append(('NB', GaussianNB()))
    models.append(('RF', RandomForestClassifier()))
    models.append(('SVM', SVC(gamma='auto')))

    # evaluate each model in turn
    results = []

    for name, model in models:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            results.append(y_pred)
            msg = "%s: %f" % (name, accuracy_score(y_test,y_pred))
            print(msg)
    print("\n")
    return [y_test, results] # Returns all accuracy scores of the single models

## Task 1) Data Overview

Take a look at both datasets. What values are you working with? Also think about which models could be suitable for each dataset.

## Task 2) Data Cleanup

Try to replace all missing values with the means of their according columns and convert all non numeric values into numeric values in the german_credit_dataset. Can you also come up with a new feature to improve your result?

## Task 3) Find a suitable Model

Try your dataset with the given test_dataset function and interpret the results. Which ml-learning model is the best suited one for each problem?

## Task 4) Confusion Matrix

Give out a confusion Matrix with the models you have choosen and interpret the results. Why does one model and dataset perform better than the other?