# CS534 Homework 2


Please provide an answer to the following question:

# Question 1 (15 pts)

Implement the ADABoost algorithm by using the scikit implementation of the logistic regression. Evaluate the result on a real dataset between a single logistic regression and AdaBoost (use K-Fold cross validation).

This links can be helpful: http://rob.schapire.net/papers/explaining-adaboost.pdf and https://en.wikipedia.org/wiki/AdaBoost


In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.cluster import DBSCAN, KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler



def ADABoost(X,y,T):
    weights = np.empty(len(X))
    weights.fill(1/len(X))
    listWeights = []
    listPredictions = []
    sigs = []
    models = []
    
    for t in range(T):
        logisticRegr = LogisticRegression(solver='liblinear')
        logisticRegr.fit(X, y, weights)
        models.append(logisticRegr)
        results =  logisticRegr.predict(X)
        ##get all indices that were wrong
        wrongInd = (results!=y)
        ##sum their weight
        totalError = (wrongInd *weights).sum()
        signifigance = .5 * np.log((1-totalError)/totalError)
        sigs.append(signifigance)
        ##get new weights
        newWeights = weights * np.exp(-signifigance*y*results)
        ##normalize them
        newWeights = newWeights*(1/newWeights.sum()) 
        weights = newWeights
    return models, sigs


def ADABoostPredict(M,S,X,y):
    listPredictions = []
    
    for x in range(len(M)):
        prediction = S[x]*M[x].predict(X)
        listPredictions.append(prediction)
    
    listPredictions = np.array(listPredictions)
    newY = []

    for x in range(len(y)):
        newY.append(np.sign((listPredictions[:,x]).sum()))
    
    accuracy = (accuracy_score(y, newY))   
    
    return accuracy



In [2]:
##Testing for the Iris DataSet

data = pd.read_csv('iris.csv',sep=',').values
X=data[:,0:4]
y=data[:,4]

def val(s):
    if s=='Iris-virginica':
        return 1
    return -1
y=np.array([val(x) for x in y])


skf = StratifiedKFold(n_splits=10, shuffle=True)
logAcc = 0
adaAcc = 0

for train_index, test_index in skf.split(X, y):
    X_train=X[train_index]
    X_test=X[test_index]
    y_train=y[train_index]
    y_test=y[test_index]
    logisticRegr = LogisticRegression(solver='liblinear')
    logisticRegr.fit(X_train, y_train)
    results =  logisticRegr.predict(X_test)
    #print(accuracy_score(y_test, results))
    logAcc += accuracy_score(y_test, results)
    
for train_index, test_index in skf.split(X, y):
    X_train=X[train_index]
    X_test=X[test_index]
    y_train=y[train_index]
    y_test=y[test_index]
    M,S = ADABoost(X_train,y_train,1000)
    acc = ADABoostPredict(M,S,X_test,y_test)
    #print(acc)
    adaAcc += acc
    
print("Logistic Regression Average Accuracy:")
print(logAcc/10)
print("AdaBoost Average Accuracy:")
print(adaAcc/10)

Logistic Regression Average Accuracy:
0.9733333333333333
AdaBoost Average Accuracy:
0.9466666666666667


In [3]:
##Testing for the Banknote Dataset

data = pd.read_csv('data_banknote_authentication.txt',sep=',').values

X = data[:,0:3]
y = data[:,4]

def val(s):
    if s==1:
        return 1
    return -1
y=np.array([val(x) for x in y])

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=10, shuffle=True)
logAcc = 0
adaAcc = 0
for train_index, test_index in skf.split(X, y):
    X_train=X[train_index]
    X_test=X[test_index]
    y_train=y[train_index]
    y_test=y[test_index]
    logisticRegr = LogisticRegression(solver='liblinear')
    logisticRegr.fit(X_train, y_train)
    results =  logisticRegr.predict(X_test)
    #print(accuracy_score(y_test, results))
    logAcc += accuracy_score(y_test, results)
for train_index, test_index in skf.split(X, y):
    X_train=X[train_index]
    X_test=X[test_index]
    y_train=y[train_index]
    y_test=y[test_index]
    M,S = ADABoost(X_train,y_train,1000)
    acc = ADABoostPredict(M,S,X_test,y_test)
    #print(acc)
    adaAcc += acc
print("Logistic Regression Average Accuracy:")
print(logAcc/10)
print("AdaBoost Average Accuracy:")
print(adaAcc/10)

Logistic Regression Average Accuracy:
0.98978631122395
AdaBoost Average Accuracy:
0.9832328361366761


Found it difficult to consistently have the AdaBoost outperform the Logistic Regression Classifier on its own. Most the time it would be close, but not quite. Only had one instance where AdaBoost had a higher accuracy than the Logistic Regression Classifier. Maybe that is the expected outcome? Tried on more than one dataset as well. Similar results.

## Question 2 (10 pts)
Use DBscan (try with different parameters) and K-means (K=3) on IRIS Dataset and discuss/compare the results with the iris ground truth.
Please provide an explanation of why K-fold cross validation is not required for the comparison among these different algorithms.

When looking at both DBscan and K-means, it is easy to see that K-means provides the best accuracy, even when reducing the Iris dataset to binary dataset. This is most likely correlated to how DBscan works, which is establishing clusters based on distance between points (eps), as well as how many points must be connected to form that cluster (min_samples). Since DBscan breaks it's dataset down into two clusters, with any outliers of those clusters being classified as noise, it becomes difficult to establish the proper distance between cluster centers to include those outliers. If the distance is set high enough, then eventually DBscan will only return one cluster, which really defeats the purpose of clustering in the first place.

The reason the K-fold cross validation is not required for the comparison is because we are testing on the data that we trained. There is no over fitting in terms of clustering because these approaches are algorithmic.

References:

https://www.dummies.com/programming/big-data/data-science/how-to-create-an-unsupervised-learning-model-with-dbscan/
https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187

In [4]:
data = pd.read_csv('iris.csv',sep=',').values
X=data[:,0:4]
y=data[:,4]

def val(s):
    if s=='Iris-virginica':
        return 1
    return 0
y=np.array([val(x) for x in y])


In [5]:
dbscan = DBSCAN()
dbscan.fit(X)
dbscan.labels_
accuracy_score(y, dbscan.labels_) ##default

0.5906040268456376

In [6]:
dbscan = DBSCAN(eps=1, min_samples=15)
dbscan.fit(X)
dbscan.labels_
accuracy_score(y, dbscan.labels_)

0.6510067114093959

In [7]:
dbscan = DBSCAN(eps=3, min_samples=2)
dbscan.fit(X)
dbscan.labels_
accuracy_score(y, dbscan.labels_)

0.6644295302013423

In [8]:

data = pd.read_csv('iris.csv',sep=',').values
X=data[:,0:4]
y=data[:,4]

def val(s):
    if s=='Iris-setosa':
        return 1
    elif s =='Iris-versicolor':
        return 0
    else:
        return 2
    
y=np.array([val(x) for x in y])

kmeans = KMeans(n_clusters=3)
y_kmeans = kmeans.fit_predict(X)
accuracy_score(y, y_kmeans)

0.8926174496644296

In [9]:
y_kmeans

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)