# DIT821 Software Engineering for AI systems - exam 2020-08-17


### Assignement: Binary Classification


* Name, e-mail:

###  Assignement description

A `Breast Cancer Winsconsin` dataset from 
University of Wisconsin  has been loaded for you in this Notebook from `sklearn.datasets` module. This dataset can be used to predict malignancy in breast cancer and contains 569 training examples with two classes (malignant and benign) in target 'Diagnosis' column. There are 30 features that describe characteristics of the cell nuclei present in the scanned images useful to discriminate benign from malignant lumps.

You are required to implement a **Regularized Logistic Regression** classifier. The classifier is used to predict if a new patient has a probability of developing malignancy or not, basing on the the data.

The first column of the dataset corresponds to the patient ID, while the last column represents the diagnosis (the outcome can be “Benign” or “Malignant” based on the type of diagnosis reported).

Brief overview of dataset:
-  Class (Diagnosis): 
      - malignant: 1
      - benign: 0
- Features are :
    - ID Number 
    - radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension (of each its computed mean-10 features, standard error-10 features and mean of three largest values-10 features) 
      

The next subsequent three cells implements code to import required libraries and load data. You are not required to make any changes.

In [97]:
# used for manipulating directory paths
import os

# Scientific and vector computation for python
import numpy as np

# Plotting library
import matplotlib.pyplot as pyplot

# Optimization module in scipy
from scipy import optimize

# library written for this exercise providing additional functions
import utils

from sklearn.model_selection import train_test_split

# tells matplotlib to embed plots within the notebook
%matplotlib inline

In [98]:
# Load breast cancer data from sklearn.datasets
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y = True)

# We create a small test set (x_test, y_test) as a subset from dataset to use later for prediction
X, x_test, y, y_test = train_test_split(X, y, test_size = 0.05, random_state = 42) 

# Question 1

In the cells below, implement **regularized logistic regression classifier** by writing code for computing:
- `sigmoid function`
- `cost function` regularized logistic regression
- `gradient descent` regularized logistic regression

Before implementing these functions, first print the shape of features (X), target (y) and size of training examples. 

At the end, test your implementation using initialized weights `w` of zeros and `lambda_` value of 1.

In [99]:
# Print shape of X and y arrays
# ====================== YOUR CODE HERE ==================
print(X.shape , y.shape)

# Print the number of training examples
# ====================== YOUR CODE HERE ==================
print(x_test.shape, y_test.shape)

(540, 30) (540,)
(29, 30) (29,)


In [100]:
def sigmoid(z):
    """
    Compute sigmoid function given the input z
    """
    # Return correct value of g
    g = np.zeros(z.shape)
    
    # ====================== YOUR CODE HERE ======================
    
    e = np.exp(1)
    
    g = 1 / (1 + (e**(-1 *z)))

    # =============================================================
    return g

In [101]:
def costFunctionReg(w, X, y, lambda_):
    """
    Compute cost for regularized logistic regression. 
     
    """
    m = y.size  # number of training examples

    # You need to return the following variables correctly 
    J = 0
    # ===================== YOUR CODE HERE ======================
    z = np.dot(X,w)
    sigm = sigmoid(z)
    J = (np.sum((-1*np.dot(y,np.log(sigm))) + (-1*np.dot((1-y),np.log(1-sigm))))/m) + ((lambda_/(2*m))*np.sum(np.power(w[1:], 2)))
    
    # =======
    # =============================================================
    return J

In [102]:
def gradientReg(w, X, y, lambda_):
    """
    Compute gradient for regularized logistic regression. 
    
    """
    grad = np.zeros(w.shape)

    # ===================== YOUR CODE HERE ======================
    z = np.dot(X,w)
    sigm = sigmoid(z)
    for j in range (grad.size):
        grad[j] = np.sum((sigm - y) * X[:, j]) / m 
        if j > 0: grad[j] +=(lambda_/m)*w[j]
    
    
    # =============================================================
    return grad

In [103]:
# Setup the data matrix appropriately, and add ones for the intercept term
m, n = X.shape

# Add intercept term to X
X = np.concatenate([np.ones((m, 1)), X], axis=1)

In [104]:
# Test your implementation by computing cost for initialized weights of zeros and lambda_= 1

# Set regularization parameter
lambda_ = 1

# Initialize fitting parameters
initial_w = np.zeros(n + 1)

cost = costFunctionReg(initial_w, X, y, lambda_)
grad = gradientReg(initial_w, X, y, lambda_)

print('Cost at initial w (zeros): {:.3f}'.format(cost))
print('Expected cost (approx): 0.693\n')

print('Gradient at initial w (zeros) - first five values only:')
print('\t[{:.4f}, {:.4f}, {:.4f}, {:.4f}, {:.4f}]'.format(*grad[:5]))

print('Expected gradients (approx) - first five values only:')
print('\t[-0.1278, -0.5714, -1.5977, -3.1023, 35.9895]\n')

Cost at initial w (zeros): 0.693
Expected cost (approx): 0.693

Gradient at initial w (zeros) - first five values only:
	[-0.1278, -0.5714, -1.5977, -3.1023, 35.9895]
Expected gradients (approx) - first five values only:
	[-0.1278, -0.5714, -1.5977, -3.1023, 35.9895]



## Question 2

#### The next cell implements code for learning the weights. Execute the cell, and evaluate the quality of the learned weights that are found. Notice that  `lambda_` is initiated with 10.

#### You will evaluate in two ways:
1. ##### using `predict` function on test set (x_test) and printing the predicted results and actual values (y_test).  
The `predict` function should produce “1” or “0” predictions given a dataset and a learned parameter vector  `w`.


2. ##### checking and printinf the accuracy of the prediction in the training dataset


In [105]:
## You are not required to make changes
from scipy.optimize import minimize

# Initialize fitting parameters
m, n = X.shape

initial_w = np.zeros(X.shape[1])

# Set regularization parameter
lambda_ = 100

res = minimize(costFunctionReg, 
               initial_w, 
               (X,y, lambda_), 
               method='TNC', 
               jac=gradientReg, 
               options={'maxiter':100})

# the fun property of `OptimizeResult` object returns
# the value of costFunction at optimized w
cost = res.fun

# the optimized w is in the x property
w = res.x

# Printing w to screen
print('Cost at w found by optimize.minimize: {:.3f}'.format(cost))
print('Expected cost (approx): 0.684\n');

print('w:')
print('\t[{:.3f}, {:.3f}, {:.3f}]'.format(*w))


Cost at w found by optimize.minimize: 0.124
Expected cost (approx): 0.684

w:
	[24.107, 0.001, -0.016]


In [106]:
def predict(w, X):
    """
    Predict whether the label is 0 or 1 using learned logistic regression.
    Computes the predictions for X using a threshold at 0.5 
    (i.e., if sigmoid(w.T*x) >= 0.5, predict 1)
    
    Instructions
    ------------
    Complete the following code to make predictions using your learned 
    logistic regression parameters.You should set p to a vector of 0's and 1's    
    """
    
    m, n = X.shape # Number of training examples
    
    # You need to return the following variables correctly
    p = np.empty(m) # Predictions and 0 or 1 for each row in X

    # ====================== YOUR CODE HERE ======================
    p = sigmoid(np.dot(X,w))
    for i in range (m):
        if p[i] >= 0.5:
            p[i] = 1
        if p[i] < 0.5:
            p[i] = 0

    # ============================================================
    return p

In [107]:
## You are not required to make changes
# Add intercept term to test set data (x_test)
x_test = np.concatenate([np.ones((x_test.shape[0], 1)), x_test], axis=1)

In [108]:
# Print probabilites and predictions on test_set. For predictions, use predict function above

# ====================== YOUR CODE HERE ======================
predicted_values_test = predict(w,x_test)
print('Predicted values: \n', predicted_values_test)

# ============================================================

print('Actual values: \n', y_test)


Predicted values: 
 [1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1.
 1. 1. 1. 1. 1.]
Actual values: 
 [1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1]


In [111]:
#  Compute accuracy on the training dataset
# ====================== YOUR CODE HERE ======================
print('Train Accuracy: ', (np.mean(predicted_values_test == y_test) * 100))

Train Accuracy:  96.55172413793103


## Question 3

Change the value of initiailized lambda_ to 0. Has the new value of Train Accuracy improved? Briefly explain why.

## Submit the solution

When you completed the excercise, download (form File menu) this file as a jupyter Notebook file (.ipynb) and uplaod this file in the CANVAS 

*By writing down my name I declare that I have done the assignements myself

* First Name  Last Name: