In [1]:
import numpy as np
import pandas as pd

### Implemented a self-training system using a K-nearest neighbor classifier for this problem (to evaluate the level of certainty for each prediction on unlabeled data, I used the purity of the prediction - i.e. how many more ”votes” the predicted class received than the other class - and I used the distance weighted voting where the number of votes each of the K data points receives is inversely proportional to the distance from the data point). The algorithm allows to change the number of data items that are being added to the labeled data set in each iteration.

In [2]:
def cartesian_distance(sample, inputs):
    
    diff = sample - inputs
    sum_pow = np.sum(np.power(diff, 2), axis=1)
    
    return np.power(sum_pow, 0.5)

def classify(k, sorted_labels):
    
    k_neighbors = sorted_labels[:k]
    men_occurencies = np.count_nonzero(k_neighbors == ' M')
    women_occurencies = np.count_nonzero(k_neighbors == ' W')
    probability = max(men_occurencies, women_occurencies)/k
    
    return ' M' if men_occurencies > women_occurencies else ' W', probability

def KNN_classification(sample, k, df):

    labels = df['Class'].values
    inputs = df.drop('Class', axis=1).values

    cart_distance = cartesian_distance(sample, inputs)

    labeled_cart = np.vstack((cart_distance, labels))

    sorted_cart = labeled_cart.T[labeled_cart.T[:, 0].argsort()]
    sorted_labels = sorted_cart.T[1]

    return classify(k, sorted_labels)

def accuracy_metric(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) 

### Learned a classifier using the semi-supervised learning algorithm and compared it against the K-nearest neighbor classifier learned only from the labeled data Ds by evaluating the fraction of initially unlabeled points it predicts correctly (for the self-learning you can simply use the fraction of the initially unlabeled points that were assigned the correct label during the learning process). Repeated the comparison for four different numbers of data points added in each iteration, including adding 1 data point in each iteration, adding all data points in the first iteration, adding 5 points in each iteration, and one setting of random choice.

### 1) adding data points based on the constraint i.e (Probablities > confidence). Probabilities were defined based on the majority of the votes of the particular class and the total number of neigboring points

In [3]:
df = pd.read_csv('data2c.csv')
labeled_data = df.iloc[:20,:]
unlabeled_data = df.iloc[20:,:-1].reset_index(drop = True)
actual_labels = df.iloc[20:,-1].values

k = 5
confidence = 0.8

In [4]:
y_actual = []

while True:

    probabilities = []
    predictions = []
    indexes = []

    for sample in unlabeled_data.values:
        prediction_1, probability = KNN_classification(sample, k, labeled_data)
        probabilities.append(probability)
        predictions.append(prediction_1)

    indexes = np.where(np.array(probabilities) > confidence) #indexes where probabilities is greater than confidence
    
    if len(indexes[0]) == 0:
        break
        
    for x in indexes[0]:
        y_actual.append(actual_labels[x]) #Append the actual labels from the index where proability is greater than confidence
        
    pred = [predictions[i] for i in indexes[0]]  # Prediction at the indexes where p>c

    new_data = unlabeled_data.iloc[indexes[0],:].copy() # new data points to add from unlabeled data to training data wrp to the indexes fetched
    new_data['Class'] = pred

    labeled_data = pd.concat([labeled_data, new_data]).reset_index(drop = True) #updated training
    unlabeled_data = unlabeled_data.drop(indexes[0], axis = 0).reset_index(drop = True) #updated unlabeled

y_pred = labeled_data.iloc[20:,-1].values

print(f"Accuracy of the semi-supervised learning algorithm by adding data based on the constraint considering probabilities higher than the confidence provided: {accuracy_metric(y_actual, y_pred)}")

Accuracy of the semi-supervised learning algorithm by adding data based on the constraint considering probabilities higher than the confidence provided: 0.7560975609756098


### 2) adding 1 data point in each iteration

In [5]:
labeled_data = df.iloc[:20,:]
unlabeled_data = df.iloc[20:,:-1].reset_index(drop = True)
actual_labels = df.iloc[20:,-1].values

k = 5

In [6]:
y_actual = []

while True:

    probabilities = []
    predictions = []
    for sample in unlabeled_data.values:
        prediction_1, probability = KNN_classification(sample, k, labeled_data)
        probabilities.append(probability)
        predictions.append(prediction_1)

    sort_index = np.flip(np.argsort(probabilities))

    if len(sort_index)==0:
        break
    
    y_actual.append(actual_labels[sort_index[0]])
    pred = [predictions[i] for i in sort_index]

    new_data = unlabeled_data.iloc[sort_index[0],:].copy()
    new_data['Class'] = pred[0]
    labeled_data = labeled_data.append(new_data, ignore_index = True)
    unlabeled_data = unlabeled_data.drop(sort_index[0], axis = 0).reset_index(drop = True)
    
y_pred = labeled_data.iloc[20:,-1].values

print(f" Accuracy of the semi-supervised learning algorithm by adding 1 data point in each iteration: {accuracy_metric(y_actual, y_pred)}")

 Accuracy of the semi-supervised learning algorithm by adding 1 data point in each iteration: 0.63


### 3) adding 5 points in each iteration

In [7]:
labeled_data = df.iloc[:20,:]
unlabeled_data = df.iloc[20:,:-1].reset_index(drop = True)
actual_labels = df.iloc[20:,-1].values

k = 5

In [8]:
y_actual = []

while True:
    
    probabilities = []
    predictions = []
    for sample in unlabeled_data.values:
        prediction_1, probability = KNN_classification(sample, k, labeled_data)
        probabilities.append(probability)
        predictions.append(prediction_1)

    sort_index = np.flip(np.argsort(probabilities))
    
    if len(sort_index)==0:
        break
    
    
    for x in sort_index[:5]:
        y_actual.append(actual_labels[x])

    pred = [predictions[i] for i in sort_index]

    new_data = unlabeled_data.iloc[sort_index[:5],:].copy()
    new_data['Class'] = pred[:5]
    labeled_data = labeled_data.append(new_data, ignore_index = True)
    unlabeled_data = unlabeled_data.drop(sort_index[:5], axis = 0).reset_index(drop = True)
    
y_pred = labeled_data.iloc[20:,-1].values

print(f" Accuracy of the semi-supervised learning algorithm by adding 5 data points in each iteration: {accuracy_metric(y_actual, y_pred)}")

 Accuracy of the semi-supervised learning algorithm by adding 5 data points in each iteration: 0.52


### 4) adding all data points in the first iteration

In [9]:
labeled_data = df.iloc[:20,:]
unlabeled_data = df.iloc[20:,:-1].reset_index(drop = True)
actual_labels = df.iloc[20:,-1].values

k = 5

In [10]:
y_actual = []

while True:
    
    probabilities = []
    predictions = []
    for sample in unlabeled_data.values:
        prediction_1, probability = KNN_classification(sample, k, labeled_data)
        probabilities.append(probability)
        predictions.append(prediction_1)

    sort_index = np.flip(np.argsort(probabilities))
    
    if len(sort_index)==0:
        break
    
    for x in sort_index[:100]:
        y_actual.append(actual_labels[x])
    
    pred = [predictions[i] for i in sort_index]

    new_data = unlabeled_data.iloc[sort_index[:100],:].copy()
    new_data['Class'] = pred[:100]
    labeled_data = labeled_data.append(new_data, ignore_index = True)
    unlabeled_data = unlabeled_data.drop(sort_index[:100], axis = 0).reset_index(drop = True)
    
y_pred = labeled_data.iloc[20:,-1].values

print(f" Accuracy of the semi-supervised learning algorithm by adding all data points in each iteration: {accuracy_metric(y_actual, y_pred)}")

 Accuracy of the semi-supervised learning algorithm by adding all data points in each iteration: 0.77


### Using KNN Classifier (Supervised Learning) 

In [11]:
df = pd.read_csv('data2c.csv')
labeled_data = df.iloc[:20,:]
unlabeled_data = df.iloc[20:,:-1].reset_index(drop = True)
actual_labels = df.iloc[20:,-1].values

k = 5

In [12]:
predictions = []
for sample in unlabeled_data.values:
    prediction_1, probabilities = KNN_classification(sample, k, labeled_data)
    predictions.append(prediction_1)

In [13]:
accuracy_metric(actual_labels, predictions)

0.77

### Discuss if you see any performance difference and if so, what it is.

Accuracy of the semi-supervised learning algorithm by adding data based on the constraint considering probabilities higher than the confidence provided: 0.7560975609756098 <br>
Accuracy of the semi-supervised learning algorithm by adding 1 data point in each iteration: 0.63 <br>
Accuracy of the semi-supervised learning algorithm by adding 5 data points in each iteration: 0.52 <br>
Accuracy of the semi-supervised learning algorithm by adding all data points in each iteration: 0.77 <br>

From the above results using k=5 neighboring data points and four different methods of adding data points it can be inferred that by adding the data points based on the constraint i.e considering the probabilities of the class and adding the data points that has probabilties higher than the confidence has given better results than that of other methods. This was because the training data was getting the data points whose labels were close to that of actual labels. For this, we haven't set the predetermined number of data points to add so it got more data datapoints to train on comparatively and hence performing better than adding one datapoint and five data points in one iteration. Moreover, using all the data points in the single iteration and predicting the labels on these data points will give the same results as that of KNN classifier used in supervised learning i.e with the known data labels.