# Exercise 1

In this exercise, we will perform classification on a simple dataset.

We can generate the dataset with:

In [4]:
from sklearn.datasets import make_blobs

(X, y) = make_blobs(n_samples=5000, n_features=2, centers=2,
cluster_std=3.5, random_state=1)

X and y are the features and the label, respectively. We can look at the format of the data:

In [5]:
print(type(X))
print(type(y))

print(X.shape)
print(y.shape)

print(X.dtype)
print(y.dtype)

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(5000, 2)
(5000,)
float64
int64


As you can see, X and y are numpy arrays. There are a total of 5000 samples in the dataset. X has 2 features per sample, and y are the corresponding classes.

## Plotting the data

When working with a dataset, it is always helpful to visualize the data that we are working with, in order to be able to check if our results are meaningful. We will use matplotlib to plot the data.

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
plt.figure()
plt.scatter(X[:, 0], X[:, 1], marker="x", c=y)
plt.show()

We can see that the two clusters are not linearily separable, but a linear classification would still yield good results. Your tasks in this notebook are now:

- Add and adapt your regression code (using numpy, not sklearn) from the previous exercise, so that is can predict the class label for each sample. This means, we treat the class of a point (0 or 1) as the target value of the regression. The linear regression will then yield us a formula that can give each (x, y) location a score, which class it might belong to. It's not a probability because it does not need to be between 0 and 1, but we can interpret a value being nearer to 0 as a higher probability for the class 0, and a value being nearer to 1 as a higher probability for class 1.
-   (e.g. yellow for class 0, blue for class 1).
- Calculate the classification accurracy $\left( \frac{n_{\text{correctly-classified}}}{(n_{\text{correctly-classified}} + n_{\text{incorrectly-classified}})} \right) $
- Bonus: If everything works, you will see that your classification splits the dataset at a specific line between the two clusters. Calculate the formula for this line and plot it as a line using matplotlib.
- After you have done the iris classification exercise, print out the [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) using sklearn for this binary classification task. Then, plot the precision-recall curve using [precision_recall_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve) and matplotlib
- Bonus: After you have done all the above, write your own code to calculate the metrics shown in classification_report and use that code to generate the precision-recall curve manually.

## Solution

In [11]:
#imports
import numpy as np

#parameters
alpha = 0.01  # learning rate
num_iters = 1000  # number of iterations

# initialising the coffecients for muliple leear regression
coeffs = np.zeros(X.shape[1])

# gradient descent algorithm
for i in range(num_iters):
    # Compute predictions and errors
    y_pred = np.dot(X, coeffs)
    errors = y_pred - y
    
    # Update coefficients
    gradient = np.dot(X.T, errors) / X.shape[0]
    coeffs -= alpha * gradient

# making a list of predictions using linear regression formula y = w1.x1 + w2.x2 for multiple linear regression
prediction = [1 if (np.sum(coeffs*X[i,:]) > 1) else 0 for i in range(int(X[:].shape[0]))]

# defining two metrics for calculating accuracy, here we check the delta between given labels and our predictions
acc_prediction = np.sum([1 if ((y[i] - prediction[i]) == 0) else 0 for i in range(len(prediction))]) # adding the zero deltas as accurate
inacc_prediction = np.sum([1 if ((y[i] - prediction[i]) == 1) else 0 for i in range(len(prediction))]) # adding the non zero as inaccurate

#printing the final accuracy of the linear regression classifier
print('Accuracy', (acc_prediction/(acc_prediction + inacc_prediction)*100), '%')

#plotting
plt.plot(prediction)
plt.show()


Accuracy 62.62 %


NameError: name 'plt' is not defined