In [1]:
# import pandas for csv data loading
import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing as pre

### Task 1 Linear regression

We are gonna use the Computer Hardware Data Set for this task, https://archive.ics.uci.edu/ml/datasets/Computer+Hardware

![](pics/cpu.jpg)

In [2]:
pd.read_csv('data/cpu_performance/machine.data').head()

Unnamed: 0,vendor,name,MYCT,MMIN,MMAX,CACH,CHMIN,CHMAX,PRP,ERP
0,adviser,32/60,125,256,6000,256,16,128,198,199
1,amdahl,470v/7,29,8000,32000,32,8,32,269,253
2,amdahl,470v/7a,29,8000,32000,32,8,32,220,253
3,amdahl,470v/7b,29,8000,32000,32,8,32,172,253
4,amdahl,470v/7c,29,8000,16000,32,8,16,132,132


This dataset has 10 attribute:

- vendor: is the name of the vender
- name: many unique symbols
- MYCT: machine cycle time in nanoseconds (integer)
- MMIN: minimum main memory in kilobytes (integer)
- MMAX: maximum main memory in kilobytes (integer)
- CACH: cache memory in kilobytes (integer)
- CHMIN: minimum channels in units (integer)
- CHMAX: maximum channels in units (integer)
- PRP: published relative performance (integer)
- ERP: estimated relative performance from the original article (integer)

The task is to use the first 8 attribute to predict the 9th attribute (which is the published relative performance)

You will need to build a linear regression model for this task, by using MSE(mean square error) loss function, you are expected to get a loss lower than 6000 on you test set

In [3]:
def Linear_regress (x_train, y_train,iters = 10000, step = 0.001):
    theta = np.zeros(x_train.shape[1])
    for i in range(iters):
        h = np.dot(x_train, theta)
        gradient = np.dot(h - y_train, x_train)/y_train.size
        theta = theta - step * gradient
    return theta

In [10]:
df = pd.read_csv('data/cpu_performance/machine.data')
x1, x2, y = np.array(df.iloc[:,0:2]),np.array(df.iloc[:,2:-2]), np.array(df.iloc[:,-2])
enc = pre.OneHotEncoder()
x1 = enc.fit_transform(x1).toarray()
x2_mean, y_mean = np.mean(x2,axis = 0), np.mean(y,axis = 0)
x2_std, y_std = np.std(x2,axis = 0), np.std(y,axis = 0)
x2,y = (x2 - x2_mean) / x2_std, (y - y_mean) / y_std
x = np.concatenate((x1, x2), axis=1)

In [11]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 60)

In [12]:
# add constant column
x_train = np.concatenate((np.ones((x_train.shape[0], 1)), x_train), axis=1)
x_test = np.concatenate((np.ones((x_test.shape[0], 1)), x_test), axis=1)

In [13]:
theta = Linear_regress(x_train,y_train)
h_test = np.dot(x_test, theta)
h_test_t = h_test * y_std + y_mean
y_test_t = y_test * y_std + y_mean
cost = ((h_test_t - y_test_t) ** 2 / 2).mean()
cost

1593.9033341468166

## Linear regression using sklearn

In [14]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [15]:
h_test = model.predict(x_test)
h_test_t = h_test * y_std + y_mean
y_test_t = y_test * y_std + y_mean
cost = ((h_test_t - y_test_t) ** 2 / 2).mean()
cost

1948.8204269476373

### Task 2 Logistic regression


The dataset we are gonna use is Glass identification dataset from UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/Glass+Identification
![](pics/glass.jpg)

In [10]:
pd.read_csv('data/glass_ident/glass.data').head()

Unnamed: 0,id,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,class
0,1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


This dataset has 10 attributes:

- id is a number representing the specific data point
- RI is the refractive index
- Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
- Mg: Magnesium
- Al: Aluminum
- Si: Silicon
- K: Potassium
- Ca: Calcium
- Ba: Barium
- Fe: Iron

This dataset has 6 class, which is 1,2,3,5,6,7, NOTE THAT IN THIS DATASET THERE IS NO CLASS 4

- 1 building_windows_float_processed
- 2 building_windows_non_float_processed
- 3 vehicle_windows_float_processed
- 4 vehicle_windows_non_float_processed (None in this dataset)
- 5 containers
- 6 tableware
- 7 headlamps

You have two sub-tasks on this dataset;

1. Build a logistic regression model to classify class 2 and not class 2, i.e. a binary classifier to separate class 2 from everything else. This binary classifier should be able to get an accuracy higher than 85%
2. Build a multiclass classification model by build 6 binary classifiers. This multiclass classifier should be able to get an accuracy higher than 50%

## Grading Policy

You can use some high level library(PyTorch, TensorFlow, sklearn) to complete the tasks, But there will be **Bonus** if you use **Numpy** to implement the algorithms from scratch.

### Forward pass, compute classifier output and cross entropy loss

compute $h_{\theta}(x)$
$$
h_{\theta}(x)=\frac{1}{1+e^{-\theta^T x}}
$$

compute $J(\theta)$

$$
J(\theta)=\frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^{(i)}),y^{(i)})
$$

compute $Cost(h_{\theta}, y)$ (cross entropy)

$$
Cost(h_{\theta}, y)=-y log((h_{\theta}(x))-(1-y)log(1-(h_{\theta}(x)))
$$

### Backward pass, compute gradients and update the classifier's weight

compute the gradient
$$
\frac{\partial J(\theta)}{\partial \theta}=\sum_{i=1}^{m}(h_{\theta}(x)-y^{(i)})x^{(i)}
$$

update the weights
$$
\theta_{j}^{new}=\theta_{j}^{old}-\alpha\frac{\partial J(\theta)}{\partial \theta}
$$


In [11]:
def binary_class(x_train, y_train,iters = 1000, step = 0.01):
    theta = np.zeros(x_train.shape[1])
    for i in range(iters):
        # forward
        h = 1 / (1 + np.exp(-np.dot(x_train, theta)))
        cost = (-y_train * np.log(h)-(1 - y_train) * np.log(1 - h)).mean()
        # backward
        gradient = np.dot(h - y_train, x_train)/y_train.size
        theta = theta - step * gradient
        # display
        #if i % 50 == 0:
            #print('Iters', i, 'cost:', cost) 
    return theta

## Binary classification 

In [12]:
df = pd.read_csv('data/glass_ident/glass.data')

In [13]:
x, y = np.array(df.iloc[:,1:-1]), np.array(df.iloc[:,-1])

In [14]:
y[y != 3] = 0
y[y == 3] = 1

In [15]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 60)

In [16]:
x_train_mean,x_train_std = np.mean(x_train,axis = 0),np.std(x_train,axis = 0)

In [17]:
x_train = (x_train - x_train_mean) / x_train_std
x_test = (x_test - x_train_mean) / x_train_std
# add constant column
x_train = np.concatenate((np.ones((x_train.shape[0], 1)), x_train), axis=1)
x_test = np.concatenate((np.ones((x_test.shape[0], 1)), x_test), axis=1)

In [18]:
theta1 = binary_class(x_train, y_train, 1000, 0.05)

In [19]:
h_test = 1 / (1 + np.exp(-np.dot(x_test, theta1)))
((h_test > 0.5) == y_test).sum() / y_test.size

0.9384615384615385

## Binary classification  using sklearn

In [20]:
from sklearn.linear_model import LogisticRegression
binary_model = LogisticRegression(random_state=0, solver='lbfgs')

In [21]:
# train
binary_model.fit(x_train,y_train) 
# predict
h_test = binary_model.predict_proba(x_test)
h_test = h_test[:, 1]
((h_test > 0.5) == y_test).sum() / y_test.size
#print(binary_model.score(x_test,y_test))

0.9230769230769231

## Multi-class classification

In [22]:
def multi_class(x_train, y_train):
    num_class = list(range(1,8))
    param = np.zeros((len(num_class), x_train.shape[1]))
    
    for i,line in enumerate(num_class):
        label_t = np.zeros_like(y_train)
        label_t[y_train == line] = 1
        param[i, :] = binary_class(x_train, label_t,10000,0.0001)
    
    return param

In [23]:
df = pd.read_csv('data/glass_ident/glass.data')
x, y = np.array(df.iloc[:,1:-1]), np.array(df.iloc[:,-1])

In [24]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 60)

In [25]:
x_train_mean,x_train_std = np.mean(x_train,axis = 0),np.std(x_train,axis = 0)

x_train = (x_train - x_train_mean) / x_train_std
x_test = (x_test - x_train_mean) / x_train_std

x_train = np.concatenate((np.ones((x_train.shape[0], 1)), x_train), axis=1)
x_test = np.concatenate((np.ones((x_test.shape[0], 1)), x_test), axis=1)

In [26]:
params = multi_class(x_train, y_train)

In [27]:
def multi_pred(param, x_test, y_test):
    logits = np.dot(x_test, np.transpose(param)).squeeze()
    prob = 1 / (1 + np.exp(-logits))
    pred = np.argmax(prob, axis=1) + 1
    accuracy = np.sum(pred == y_test) / y_test.shape[0] * 100
    return prob, pred, accuracy

In [28]:
_, preds, accu = multi_pred(params, x_test, y_test)
print("Prediction: {}\n".format(preds))
print("Accuracy: {:.3f}%".format(accu))

Prediction: [1 1 2 2 1 7 2 1 1 1 1 2 1 5 1 7 7 2 1 7 2 1 7 2 1 2 2 1 1 1 2 1 1 7 7 1 2
 7 1 1 2 1 2 7 2 7 1 7 1 1 1 2 2 5 1 1 2 7 1 1 1 2 1 7 7]

Accuracy: 64.615%


## Multi-class classification  using sklearn

In [29]:
from sklearn.linear_model import LogisticRegression
multi_model = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(x_train, y_train)

In [30]:
proba = multi_model.predict_proba(x_test)
preds = np.argmax(proba, axis=1)+1
preds[preds > 3] = preds[preds > 3]+1
accu = np.sum(preds == y_test) / y_test.shape[0] * 100

In [31]:
print("Prediction: {}\n".format(preds))
print("Accuracy: {:.3f}%".format(accu))

Prediction: [1 2 2 2 1 7 2 1 1 1 2 2 1 2 1 7 7 2 1 7 1 1 7 1 1 1 2 1 2 2 2 2 1 7 2 1 2
 7 2 1 1 1 3 7 2 7 2 6 1 1 2 2 2 5 1 1 2 7 1 1 2 2 1 2 7]

Accuracy: 63.077%
