<a href="https://colab.research.google.com/github/priyadarshinivr19/Minors-Degree-Machine-Learning/blob/main/FMML_M1L2_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preparation**

Importing Libraries

In [4]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

Importing Dataset

In [5]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [6]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


Nearest Neighbours

In [7]:
def NN1(traindata, trainlabel, query):
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

Random Classifier

In [8]:
def RandomClassifier(traindata, trainlabel, testdata):
    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

Model Evaluation

In [9]:
def Accuracy(gtlabel, predlabel):
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Splitting Data

In [10]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

# **Model Application**

In [11]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


In [12]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

In [13]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


In [14]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


In [15]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


In [16]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


# **Questions Set 1**

**1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?**

In [17]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 30 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 31.528113142462917 %


In [18]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 70 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 33.066502463054185 %


In [19]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 99.9 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 33.33333333333333 %


In [20]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 0.1 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 32.073755079759806 %


Overall, the validation accuracy seems to increase with increase in size of validation dataset percentage and decreases as we reduce the percentage.

**2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?**

In [21]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 30 / 100)
valpred = NN(traindata, trainlabel, valdata)
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


In [22]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 30 / 100)
valpred = NN(traindata, trainlabel, valdata)
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


In [23]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 99.9 / 100)
valpred = NN(traindata, trainlabel, valdata)
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


In [24]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 0.1 / 100)
valpred = NN(traindata, trainlabel, valdata)
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


The test set accuracy does not seem to change with size of validation dataset percentage .

**3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?**

One can allocate 30 percent for the validation dataset in order to balance the two factors.

# **Exercise**

In [25]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


In [26]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)

In [30]:
# Implementing 1-Nearest Neighbor classifier
knn_1 = KNeighborsClassifier(n_neighbors=1)
knn_1.fit(alltraindata, alltrainlabel)

y_pred_1 = knn_1.predict(testdata)

# Calculating the accuracy for the 1-NN classifier
accuracy_1 = accuracy_score(testlabel,y_pred_1)
print(f"Accuracy of 1-Nearest Neighbor classifier: {accuracy_1:.4f}")

# Implementing 3-Nearest Neighbor classifier
knn_3 = KNeighborsClassifier(n_neighbors=3)
knn_3.fit(alltraindata, alltrainlabel)

# Implementing the 3-NN classifier
y_pred_3 = knn_3.predict(testdata)

# Calculating the accuracy for the 3-NN classifier
accuracy_3 = accuracy_score(testlabel,y_pred_3)
print(f"Accuracy of 3-Nearest Neighbor classifier: {accuracy_3:.4f}")

# Comparing results
if accuracy_1 > accuracy_3:
    print("1-Nearest Neighbor is more accurate.")
elif accuracy_1 < accuracy_3:
    print("3-Nearest Neighbor is more accurate.")
else:
    print("Both classifiers have the same accuracy.")

Accuracy of 1-Nearest Neighbor classifier: 0.3498
Accuracy of 3-Nearest Neighbor classifier: 0.3550
3-Nearest Neighbor is more accurate.


# **Multiple Splits**

In [32]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations

In [33]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 34.32860224772855 %
Test accuracy: 34.97901752653666 %


# **Questions Set 2**

**1. Does averaging the validation accuracy across multiple splits give more consistent results?**

Yes, averaging validation accuracy across multiple splits typically provides more consistent and reliable results.

**2. Does it give more accurate estimate of test accuracy?**

Yes, averaging validation accuracy across multiple splits, such as in cross-validation, generally provides a more accurate and reliable estimate of the true test accuracy of a model.

**3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?**

Almost every iteration of model being applied gives more or less give the same estimate, in the method we have used, however methods like K-Fold Cross Validation can give better estimates.

**4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?**

Increasing the number of iterations in cross-validation or using other techniques to address small training or validation datasets can help mitigate some of the challenges associated with limited data, but it has limitations and might not fully address all issues.