Data Mining: Basic Concepts - WS'19/20 
---------------
``` 
> University of Konstanz 
> Department of Computer and Information Science
> Dr. Johannes Fuchs, Eren Cakmak, Frederik Dennig
```

---

#### Exercise 1:  Evaluation of Classifiers - Confusion Matrix

As a first start to evaluate the results of classifiers, a confusion matrix can be calculated on the test data. Through it, it is possible to find classes on which the classifier performs poorly. 

Use only the following imports. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

##### __(a) Load the “adult.data” file as training and the “adult.test” file as test data. Both can be found in Ilias.__

In [2]:
# Headers are extracted from the documentation
header_names = ['age', 'workclass', 'fnlwgt', 'education',
                'education-num', 'marital', 'occupation',
                'relationship', 'race', 'sex', 'capital-gain',
                'capital-loss', 'hourspweek', 'country',
                'class']

adult_train = pd.read_csv('Data/ass05_train.csv',
                          index_col=False,
                          header=None,
                          names=header_names)
adult_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital,occupation,relationship,race,sex,capital-gain,capital-loss,hourspweek,country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
adult_test = pd.read_csv('Data/ass05_test.csv',
                         index_col=False,
                         header=None,
                         names=header_names)
adult_test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital,occupation,relationship,race,sex,capital-gain,capital-loss,hourspweek,country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.


##### __(b) Preprocess and clean the data. Do not drop any columns. Convert the dataset into a format which can be inputed into a classifier.__
_(Hint: For example, drop `NAN`, transform continues variables to numeric, normalization etc. )_

In [4]:
# Preprocessing needs to be performed on the whole dataset
# (at least since min/max-normalization is used)
df = adult_train.append(adult_test, ignore_index=True)
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital,occupation,relationship,race,sex,capital-gain,capital-loss,hourspweek,country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age              48842 non-null int64
workclass        48842 non-null object
fnlwgt           48842 non-null int64
education        48842 non-null object
education-num    48842 non-null int64
marital          48842 non-null object
occupation       48842 non-null object
relationship     48842 non-null object
race             48842 non-null object
sex              48842 non-null object
capital-gain     48842 non-null int64
capital-loss     48842 non-null int64
hourspweek       48842 non-null int64
country          48842 non-null object
class            48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


### Preprocessing Training Data
> In the following step, I will preprocess the merged data.  
> I will check for missing values and drop these (if not too many)  
> and afterwards normalize all numeric columns.  
> In the next step, all categorical variables will be turned into dummies.

In [6]:
## Numeric Variables
#
# AGE,
# FNLWGT,
# EDUCATION-NUM,
# CAPITAL-GAIN,
# CAPITAL-LOSS, and
# HOURSPWEEK
#   Are already numeric with no missings, need to be normalized


## Categorical / String Variables
#
#   Change '?' to real missing values
df.loc[(df['workclass'] == ' ?'), 'workclass'] = np.nan
df.loc[(df['occupation'] == ' ?'), 'occupation'] = np.nan
df.loc[(df['country'] == ' ?'), 'country'] = np.nan

#   Create dummies for binary features 
#    Sex / Female (for easier understanding)
df.rename(columns = {'sex': 'female'}, inplace = True)
df.loc[(df['female'] == ' Male'), 'female'] = 0
df.loc[(df['female'] == ' Female'), 'female'] = 1

#    Class
df.loc[(df['class'] == ' <=50K'), 'class'] = 0
df.loc[(df['class'] == ' <=50K.'), 'class'] = 0
df.loc[(df['class'] == ' >50K'), 'class'] = 1
df.loc[(df['class'] == ' >50K.'), 'class'] = 1

In [7]:
# Calculate number of NAN-rows and potential information loss
nan_rows = len(df[df.isna().any(axis=1)])
percentage = (nan_rows/df.shape[0])*100
remaining_rows = len(df[df.notna().all(axis=1)])

print('There are {0} rows with at least one missing value in them.'.format(nan_rows))
print('That equals {0:.2f}%. Therefore, {1} rows are kept.'.format(percentage, remaining_rows))

There are 3620 rows with at least one missing value in them.
That equals 7.41%. Therefore, 45222 rows are kept.


In [8]:
# Drop all rows with at least one NAN in it
df.dropna(axis='index', how='any', inplace=True)
print('{} rows are kept.'.format(len(df)))

45222 rows are kept.


In [9]:
## Create dummies from the remaining seven categorical features
# List of categorical variables which need to be encoded
categorical = ['workclass', 'education', 'marital', 'occupation', 'relationship', 'race', 'country']

# Get dummies and drop original feature (now obsolete)
for feature in categorical:
    df = pd.concat([df, pd.get_dummies(df[feature])], axis=1)
    df.drop(columns=[feature], inplace=True)

In [10]:
# Method for normalizing each numeric column in a dataframe (linear - min/max)
def normalize_df(df):
    cols = list(df)
    for col in cols:
        if df[col].dtypes == np.float64 or df[col].dtypes == np.int64:
            col_min = min(df[col])
            col_max = max(df[col])
            df[col] = (df[col] - col_min) / (col_max - col_min)
    return(df)

normalize_df(df)

df.head()

Unnamed: 0,age,fnlwgt,education-num,female,capital-gain,capital-loss,hourspweek,class,Federal-gov,Local-gov,...,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
0,0.30137,0.04335,0.8,0.0,0.02174,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0.452055,0.047274,0.8,0.0,0.0,0.0,0.122449,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0.287671,0.136877,0.533333,0.0,0.0,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.493151,0.149792,0.4,0.0,0.0,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0.150685,0.219998,0.8,1.0,0.0,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0


##### __(c) Train a naive bayes, decision tree, k-nearest neigbors on the training data. Calculate the accuracy of each classifier on the test data.__

In [11]:
df

Unnamed: 0,age,fnlwgt,education-num,female,capital-gain,capital-loss,hourspweek,class,Federal-gov,Local-gov,...,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
0,0.301370,0.043350,0.800000,0.0,0.021740,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0.452055,0.047274,0.800000,0.0,0.000000,0.0,0.122449,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0.287671,0.136877,0.533333,0.0,0.000000,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.493151,0.149792,0.400000,0.0,0.000000,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0.150685,0.219998,0.800000,1.0,0.000000,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48836,0.219178,0.156895,0.800000,0.0,0.000000,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
48837,0.301370,0.136723,0.800000,1.0,0.000000,0.0,0.357143,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
48839,0.287671,0.244762,0.800000,0.0,0.000000,0.0,0.500000,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
48840,0.369863,0.047666,0.800000,0.0,0.054551,0.0,0.397959,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [12]:
# At first, I have to separate training and testing data again
adult_train = df[df.index < len(adult_train)]
adult_test = df[df.index >= len(adult_train)]

In [13]:
# Extract the Data from the DataFrames
Y = adult_train['class']
X = adult_train.drop(columns=['class'])

test_Y = adult_test['class']
test_X = adult_test.drop(columns=['class'])


# Construct and Fit GNB Classifier
gnb = GaussianNB()
gnb = gnb.fit(X, Y)
# Create Mean Score for Training Data (-> Overfitting?)
training_score_gnb = gnb.score(X, Y)
# Predict Testing Classes
gnb_predict = GaussianNB.predict(gnb, test_X)
# Create Accuracy Score for Testing
gnb_acc_score = accuracy_score(test_Y, gnb_predict)
# Print Results
print(
    '''Naive Bayes
    Acc on training:  {0:.5f}
    Acc on testing:   {1:.5f}\n'''.format(training_score_gnb,
                                          gnb_acc_score))


# Construct and Fit Decision Tree Classifier
dtc = tree.DecisionTreeClassifier(random_state=1)
dtc = dtc.fit(X, Y)
# Create Mean Score for Training Data (-> Overfitting?)
training_score_dtc = dtc.score(X, Y)
# Predict Testing Classes
dtc_predict = tree.DecisionTreeClassifier.predict(dtc, test_X)
# Create Accuracy Score for Testing
dtc_acc_score = accuracy_score(test_Y, dtc_predict)
# Print Results
print(
    '''Decision Tree
    Acc on training:  {0:.5f}
    Acc on testing:   {1:.5f}\n'''.format(training_score_dtc,
                                          dtc_acc_score))


## knn takes some time, be patient
# Construct and Fit k-nearest Neighbours Classifier
knn = KNeighborsClassifier(n_neighbors=2)
knn = knn.fit(X, Y)
# Create Score for Training Data (-> Overfitting?)
training_score_knn = knn.score(X, Y)
# Predict Testing Classes
knn_predict = KNeighborsClassifier.predict(knn, test_X)
# Create Accuracy Score for Testing
knn_acc_score = accuracy_score(test_Y, knn_predict)
# Print Results
print(
    '''k-nearest Neighbour
    Acc on training:  {0:.5f}
    Acc on testing:   {1:.5f}\n'''.format(training_score_knn,
                                          knn_acc_score))

Naive Bayes
    Acc on training:  0.55941
    Acc on testing:   0.55560

Decision Tree
    Acc on training:  0.99997
    Acc on testing:   0.83030

k-nearest Neighbour
    Acc on training:  0.89079
    Acc on testing:   0.81287



##### __(d)  Write a method to compute a confusion matrix of a model for binary classification. The parameters of the method should be `model` `X_test` and `y_test`. The method should return the confusion matrix as a 2d `numpy.array`.

In [14]:
def confusion_matrix(model, X_test, Y_test):
    # Predict class with model and test data
    Y_predict = model.predict(X_test)
    
    # Set up counters for the four options (binary classification)
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    # Iterate over prediction and real class
    # increase counter of the confusion_matrix fields if case found
    for pred, true in zip(Y_predict, Y_test):
        if (pred==1) & (true==1):
            tp += 1
        elif (pred==1) & (true==0):
            fp += 1
        elif (pred==0) & (true==1):
            fn += 1
        elif (pred==0) & (true==0):
            tn += 1

    return(np.array([[tn,fp],[fn,tp]]))

# Test method of previously trained k-nearest Neighbour classifier
dtc_confusion_matrix = confusion_matrix(dtc, test_X, test_Y)
print(dtc_confusion_matrix)

[[11451  1539]
 [ 1391  2885]]


#### Exercise 2: Evaluation of Classifiers - Precision, Recall and F-Score

##### __(a) Implement a method called Precision which takes as an input a confusion matrix (2d `numpy.array`). Next, the method should calculate the formula for the precision:__

$$Precision = \frac{true\ positive}{true\ positive + false\ positive}$$

In [15]:
def precision(matrix):
    tn = matrix[0,0]
    fp = matrix[0,1]
    fn = matrix[1,0]
    tp = matrix[1,1]
    return(tp / (tp + fp))

precision(dtc_confusion_matrix)

0.652124773960217

##### __(b) Implement a method called Recall which takes as an input a confusion matrix (2d `numpy.array`). Next, the method should calculate the formula for the recall:__

$$Recall = \frac{true\ positive}{true\ positive + false\ negative}$$

In [16]:
def recall(matrix):
    tn = matrix[0,0]
    fp = matrix[0,1]
    fn = matrix[1,0]
    tp = matrix[1,1]
    return(tp / (tp + fn))

recall(dtc_confusion_matrix)

0.6746959775491114

##### __(c) Implement a method called F-Score which takes as input the precision and the recall. Next, the method should calculate the formula for the F-Score:__

$$FScore = 2\cdot\frac{Precision \cdot Recall}{Precision + Recall}$$

In [17]:
def f1score(precision, recall):
    return(2 * ((precision * recall) / (precision + recall)))

# using the before defined methods
f1score(precision(dtc_confusion_matrix), recall(dtc_confusion_matrix)) 

0.6632183908045978

##### __(d) Use the three methods (precision, recall, f1score) to recheck your accuracy of the classifiers of Ex. 1 (naive bayes, decision tree, k-nearest neigbors). Do you see some surprising results?__

In [18]:
conf_matrix_gnb = confusion_matrix(gnb, test_X, test_Y)
conf_matrix_dtc = confusion_matrix(dtc, test_X, test_Y)
conf_matrix_knn = confusion_matrix(knn, test_X, test_Y)

precision_gnb = precision(conf_matrix_gnb)
precision_dtc = precision(conf_matrix_dtc)
precision_knn = precision(conf_matrix_knn)

recall_gnb = recall(conf_matrix_gnb)
recall_dtc = recall(conf_matrix_dtc)
recall_knn = recall(conf_matrix_knn)

f1score_gnb = f1score(precision_gnb, recall_gnb)
f1score_dtc = f1score(precision_dtc, recall_dtc)
f1score_knn = f1score(precision_knn, recall_knn)

print(
    '''Precision
    Naive Bayes:       {0:.4f}
    Decision Tree:     {1:.4f}
    k-near. Neighbour: {2:.4f}\n'''.format(precision_gnb,
                                           precision_dtc,
                                           precision_knn))

print(
    '''Recall
    Naive Bayes:       {0:.4f}
    Decision Tree:     {1:.4f}
    k-near. Neighbour: {2:.4f}\n'''.format(recall_gnb,
                                           recall_dtc,
                                           recall_knn))

print(
    '''F1-Score
    Naive Bayes:       {0:.4f}
    Decision Tree:     {1:.4f}
    k-near. Neighbour: {2:.4f}\n'''.format(f1score_gnb,
                                           f1score_dtc,
                                           f1score_knn))

Precision
    Naive Bayes:       0.3511
    Decision Tree:     0.6521
    k-near. Neighbour: 0.7198

Recall
    Naive Bayes:       0.9366
    Decision Tree:     0.6747
    k-near. Neighbour: 0.4001

F1-Score
    Naive Bayes:       0.5107
    Decision Tree:     0.6632
    k-near. Neighbour: 0.5144



> **Interpretation**  
> Results are taken from the table below.  
>
> Looking at the _Test Accuracy_ only might lead to the assumption,  
> that the k-nearest Neighbours-classification and the decision tree's  
> is the best while the naive bayes classifier should not be used.  
>
> Adding the _Accuracy on Training_ data to that, one can see how badly  
> the unpruned decision tree is overfitting. But this alone does not  
> grant the full insight into classifier estimation.  
>
> _Precision_ measures, how many of the positively predicted cases are  
> actually positive. In this, k-nearest Neighbours delivers the best results  
> while the naive bayes estimator does not perform as good.  
>
> In contrast to that, the _Recall_ is highest in naives bayes whereas  
> knn performs poorly. Therefore, the naive bayes performs best when  
> false-negatives should be avoided; if the income of a data point  
> actually is above 50k, there is a very high chance of naive bayes  
> detecting it correctly.  
>  
> KNN on the other hand might be best suitable if false-positives shall  
> be avoided; so when a prediction is positive, it has the highest chance  
> for the income actually being above 50k.  
>
> The _F1-Score_ balances _Precision_ and _Recall_ and might work as a further  
> addition to the estimators, espacially if the class distribution is very uneven.  
> In this dataset, the training as well as the testing data have a positive class-  
> proportion of about 25% each.  
> In this measurement, the decision tree seems to be the best classifier, maybe  
> balancing _precision_ and _recall_ the best.  
>
> **To put it in a nutshell, this shows that the evaluation of a classifier  
> should rely on more than a single measurement and furthermore  
> depend on what the classifier will be used for.**

|                      | Naive Bayes | Decision Tree | k-near. N |
|:-------------------- | :---------: | :-----------: | :-------: |
| **Accuracy (Test)**  | 0.5556      | 0.8303        | 0.8128    |
| **Accuracy (Train)** | 0.5594      | 0.9999        | 0.8908    |
|                      |             |               |           |
| **Precision**        | 0.3511      | 0.6521        | 0.7198    |
| **Recall**           | 0.9366      | 0.6747        | 0.4001    |
| **F1-Score**         | 0.5144      | 0.6632        | 0.5144    |