# Homework 2
## Hoeffding Inequality
Run a computer simulation for fipping 1,000 virtual fair coins. Flip each coin independently 10 times. Focus on 3 coins as follows: $c_1$ is the first coin flipped, $c_{rand}$ is a coin chosen randomly from the 1,000, and $c_{min}$ is the coin which had the minimum frequency of heads (pick the earlier one in case of a tie). Let $v_1$, $v_{rand}$, and $v_{min}$ be the fraction of heads obtained for the 3 respective coins out of the 10 tosses.  

Run the experiment 100,000 times in order to get a full distribution of $v_1$, $v_{rand}$, and $v_min$ (note that crand and cmin will change from run to run).
### Question 1
The average value of $v_{min}$ is closest to:

In [2]:
import numpy as np
import random

no_coins = 1000 # number of coins to flip
no_flips = 10 # number of times to flip one coin
no_trials = 100000 # number of times to repeat the experiment
v_1, v_rand, v_min = [], [], [] # where to save the values for each trial run

for t in range(no_trials):
    sample = np.random.randint(2, size=(no_coins, no_flips))
    # Let 1’s be heads and 0’s tails
    heads = sample.sum(axis=1)/no_flips # sum heads over rows (per each coin)
    
    v_1.append(heads[1])
    # choose index of the random coin
    c_rand = random.randint(1,10)
    v_rand.append(heads[c_rand])
    v_min.append(heads[np.argmin(heads)])

# average value of v_min
v_min_avg = np.average(v_min)
print('The average value of v_min is closest to: ', 
      min([0, 0.01, 0.1, 0.5, 0.67], key=lambda x:abs(x-v_min_avg)),
     ' with v_min equal to ', v_min_avg)

The average value of v_min is closest to:  0.01  with v_min equal to  0.037624


Answer (b)

### Question 2  
Which coin(s) has a distribution of $v$ that satisfes the (single-bin) Hoeffding Inequality?


Answer: Single bin relates to only one trial where we flip coins to get H or T. The probability of heads or tails on one trial is 0.5. Hence, $\mu=0.5$ which means we need a coin which will give us probabilities close to 0.5 which will in turn make the epsilon small. Using the previous code and results we see that $v_{rand}$ and $v_1$ satisfy this logic as they are both close to 0.5 as expected. The coins $c_{min}$ will give lower probabilities since we deliberately made them small by definition.  

Answer is (d)

In [3]:
print('Average of v_rand is ', np.average(v_rand))
print('Average of v_1 is ', np.average(v_1))
print('Average of v_min is ', np.average(v_min))

Average of v_rand is  0.49915
Average of v_1 is  0.499095
Average of v_min is  0.037624


## Error and Noise
Consider the bin model for a hypothesis $h$ that makes an error with probability $\mu$ in
approximating a deterministic target function $f$ (both $h$ and $f$ are binary functions).
If we use the same $h$ to approximate a noisy version of $f$ given by:
$$P(y|\mathbf x) = \cases{
\lambda  & \text{if   } y=f(\mathbf x)\cr
(1-\lambda) & \text{if   } y \neq f(\mathbf x)
}$$

### Question 3
What is the probability of error that $h$ makes in approximating $y$? Hint: Two wrongs can make a right!

Answer: If $h$ makes an error approximating $f$ then that error can arise from two cases namely
1. $P[h(x)=f(x)] \text{   and   } y\neq f(x)$
2. $P[h(x)\neq f(x)] \text{   and   } y=f(x)$  

Hence, combining the two cases we can write
$$
P[h(x)\neq y] = (1-\mu)(1-\lambda) + \mu\lambda
$$

### Question 4
At what value of $\lambda$ will the performance of $h$ be independent of $\mu$?

Answer: We have from previous exercise that 
$$\begin{align}
P[h(x)\neq y] &= (1-\mu)(1-\lambda) + \mu\lambda \\
&=1-\lambda-\mu+\mu\lambda+\mu\lambda \\
&=1-\lambda-\mu+2\mu\lambda \\
\end{align}
$$

Hence, if $\lambda=\frac {1}{2}$, $P[h(x)\neq y]$ becomes independent of $\mu$.

## Linear Regression
In these problems, we will explore how Linear Regression for classification works. As with the Perceptron Learning Algorithm in Homework # 1, you will create your own target function $f$ and data set $D$. Take $d = 2$ so you can visualize the problem, and assume $X = [-1; 1] x [-1; 1]$ with uniform probability of picking each $\mathbf x \in \mathcal X$. In each run, choose a random line in the plane as your target function $f$ (do this by taking two random, uniformly distributed points in $[-1; 1] x [-1; 1]$ and taking the line passing through them), where one side of the line maps to +1 and the other maps to -1. Choose the inputs $x_n$ of the data set as random points (uniformly in X), and evaluate the target function on each $x_n$ to get the corresponding output $y_n$.  
### Question 5  

Take $N = 100$. Use Linear Regression to find $g$ and evaluate $E_{in}$, the fraction of in-sample points which got classified incorrectly. Repeat the experiment 1000 times and take the average (keep the $f$'s and $g$'s as they will be used again in Problem 6). Which of the following values is closest to the average $E_{in}$? (Closest is the option that makes the expression |your answer - given option| closest to 0. Use this definition of closest here and throughout.)

Answer: The process is as follows:
1. First we need to generate the input data set first and assign the target value for each $i=1,...,n$ in the input data set. As specified in the question we need to save the $f$'s in order to reuse it later. So in the code below we use the same **l** value denoting the line function i.e. our $f$. 
2. Write the linear regression algorithm by caching the $g$ values for each trial
3. Compute the $E_{in}$
4. Repeat for the number of trials needed and average the results

In [4]:
# 2. Linear regression
def linregr(N, Nt, dim, no_trials):
    
    Ein_t = [] # save the in-sample error for each trial
    Eout_t = [] # save out-out-of sample error for each trial
    w = [] # store the weights
    
    for t in range(no_trials):
        
        # Generate random values for the line segment
        xa,ya,xb,yb = [random.uniform(-1, 1) for i in range(4)]
        l = np.array([xb*ya-xa*yb, yb-ya, xa-xb])
        l = l.reshape((l.shape[0],1))
        
        # Generate train data
        X = np.column_stack((np.ones(N), 
                             np.random.uniform(-1.,1., size=(N, dim))))
        Y = np.where(np.dot(X,l)>=0,1,-1)
        
        # Compute the weights
        w = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)),X.T),Y)
        
        # Estimate hypothesis
        g = np.where(np.dot(X,w)>=0, 1, -1)
        
        # Compute in-sample error
        Error = g - Y
        Ein = np.count_nonzero(Error)/N
        Ein_t.append(Ein)
        
        # Compute out-of-sample error
        Xt = np.column_stack((np.ones(Nt),
                              np.random.uniform(-1.,1., size=(Nt, dim))))
        Yt = np.where(np.dot(Xt,l)>=0,1,-1)
        
        gt = np.where(np.dot(Xt,w)>=0, 1, -1)
        Error_t = gt - Yt
        Eout = np.count_nonzero(Error_t)/Nt
        Eout_t.append(Eout)
        
        lr_out = {'w':w,
                  'Ein_t': Ein_t,
                  'Eout_t': Eout_t}
    
    return lr_out

In [5]:
# INPUTS
N = 100 # number of training examples
Nt = 1000 # number of test examples
dim = 2 # dimension of X
no_trials = 1000 # number of times to run the experiment

lreg_out = linregr(N, Nt, dim, no_trials)

print('The value closes to the E_in is: ', 
      min([0, 0.001, 0.01, 0.1, 0.5], key=lambda x:abs(x-np.average(lreg_out['Ein_t']))),
     ' with E_in equal to ', np.average(lreg_out['Ein_t']))

The value closes to the E_in is:  0.01  with E_in equal to  0.03885


Answer: (c)

### Question 6  
Now, generate 1000 fresh points and use them to estimate the out-of-sample error $E_{out}$ of the $g$'s that you got in Problem 5 (number of misclassified out-of-sample points / total number of out-of-sample points). Again, run the experiment 1000 times and take the average. Which value is closest to the average $E_{out}$?  

Answer: Using the code from Question 5 we have:

In [6]:
print('The value closes to the E_out is: ', 
      min([0, 0.001, 0.01, 0.1, 0.5], key=lambda x:abs(x-np.average(lreg_out['Eout_t']))),
     ' with E_out equal to ', np.average(lreg_out['Eout_t']))

The value closes to the E_out is:  0.01  with E_out equal to  0.049463


Answer: (c)

## Question 7  
Now, take $N = 10$. After finding the weights using Linear Regression, use them as a vector of initial weights for the Perceptron Learning Algorithm. Run PLA until it converges to a final vector of weights that completely separates all the in-sample points. Among the choices below, what is the closest value to the average number of iterations (over 1000 runs) that PLA takes to converge? (When implementing PLA, have the algorithm choose a point randomly from
the set of misclassified points at each iteration)

In [7]:
def perceptron(N, no_trials, learn_rate):
    
    iter_cnt = []
    for t in range(no_trials):
    
        # Generate random values for the line segment
        xa,ya,xb,yb = [random.uniform(-1, 1) for i in range(4)]
        l = np.array([xb*ya-xa*yb, yb-ya, xa-xb])
        l = l.reshape((l.shape[0],1))
        
        # Generate train data
        X = np.column_stack((np.ones(N), 
                             np.random.uniform(-1.,1., size=(N, dim))))
        Y = np.where(np.dot(X,l)>=0,1,-1)
        
        # Compute the weights using linear regression
        w = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)),X.T),Y)
        
        cnt_iter = 0 # initialize counter for the convergence
        while True:
            g = np.where(np.dot(X,w) >= 0, 1, -1)
            err = g - Y
            misclass = np.nonzero(err)[0]
            if len(misclass) == 0:
                break
            else:
                point = random.choice(misclass) # choose a random point from the vector of misclassifications
                update = np.reshape(learn_rate*Y[point]*X[point],(len(w),1))
                w = w + update # update weights
                cnt_iter += 1
        iter_cnt.append(cnt_iter)
        
    return iter_cnt

In [8]:
# Inputs
N = 10
no_trials = 1000
learn_rate = 0.01

iter_cnt = perceptron(N, no_trials, learn_rate)

print('The closest value to the average number of iterations (over 1000 runs) that the PLA takes to converge is : ', 
      min([1, 15, 300, 5000, 10000], key=lambda x:abs(x-np.average(iter_cnt))),
     ' with average value equal to ', np.average(iter_cnt))

The closest value to the average number of iterations (over 1000 runs) that the PLA takes to converge is :  1  with average value equal to  6.094


Answer: (a) as expected with such small N value.

### Nonlinear Transformation  
In these problems, we again apply Linear Regression for classification. Consider the target function:  

$$f(x_1, x_2) = sign(x^2_1 + x^2_2 - 0.6)$$  

Generate a training set of $N = 1000$ points on $X = [-1; 1] x [-1; 1]$ with a uniform probability of picking each $\mathbf x \in \mathcal X$. Generate simulated noise by flipping the sign of the output in a randomly selected 10% subset of the generated training set.  
### Question 8  
Carry out Linear Regression without transformation, i.e., with feature vector:
$$(1, x_1, x_2)$$
to find the weight $w$. What is the closest value to the classification in-sample error $E_{in}$? (Run the experiment 1000 times and take the average $E_{in}$ to reduce variation in your results.)

In [9]:
# NonLinear regression
def nonlinregr(N, Nt, dim, no_trials):
    
    nEin_t = [] # save the in-sample error for each trial
    nEout_t = [] # save out-out-of sample error for each trial
    nw = [] # store the weights
    pnoise = 10 # percent of noise

    
    for t in range(no_trials):
        # Generate train data        
        X = np.column_stack((np.ones(N), 
                             np.random.uniform(-1.,1., size=(N, dim))))
        Y = np.where((np.square(X[:,1]) + np.square(X[:,2]) - 0.6) >= 0,1,-1)
        points = random.sample(range(1, N), int(N*pnoise/100)) # outputs to be altered
        Y[points] = Y[points]*(-1) # add noise to the output
        
        # Compute the weights
        nw = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)),X.T),Y)
        
        # Estimate hypothesis
        g = np.where(np.dot(X,nw)>=0, 1, -1)
        
        # Compute in-sample error
        Error = g - Y
        Ein = np.count_nonzero(Error)/N
        nEin_t.append(Ein)
        
        # Compute out-of-sample error
        Xt = np.column_stack((np.ones(Nt),
                              np.random.uniform(-1.,1., size=(Nt, dim))))
        Yt = np.where((np.square(Xt[:,1]) + np.square(Xt[:,2]) - 0.6) >= 0,1,-1)
        Yt[points] = Yt[points]*(-1)
            
        gt = np.where(np.dot(Xt,nw)>=0, 1, -1)
        Error_t = gt - Yt
        Eout = np.count_nonzero(Error_t)/Nt
        nEout_t.append(Eout)
        
        nlr_out = {'nw':nw,
                  'nEin_t': nEin_t,
                  'nEout_t': nEout_t}
    
    return nlr_out   

In [10]:
# INPUTS
N = 1000 # number of training examples
Nt = 1000 # number of test examples
dim = 2 # dimension of X
no_trials = 1000 # number of times to run the experiment

nlr_out = nonlinregr(N, Nt, dim, no_trials)

print('The value closes to the E_in is: ', 
      min([0, 0.1, 0.3, 0.5, 0.8], key=lambda x:abs(x-np.average(nlr_out['nEin_t']))),
     ' with E_in equal to ', np.average(nlr_out['nEin_t']))

The value closes to the E_in is:  0.5  with E_in equal to  0.505502


Answer: (d)

### Question 9  
Now, transform the $N = 1000$ training data into the following nonlinear feature vector:
$$(1, x_1, x_2, x_1x_2, x^2_1, x^2_2)$$
Find the vector $\widetilde{w}$ that corresponds to the solution of Linear Regression. Which
of the following hypotheses is closest to the one you find? Closest here means agrees the most with your hypothesis (has the highest probability of agreeing on a randomly selected point). Average over a few runs to make sure your answer is stable.

In [11]:
# NonLinear regression
def nonlinregr2(N, Nt, dim, no_trials):
    
    nEin_t = [] # save the in-sample error for each trial
    nEout_t = [] # save out-out-of sample error for each trial
    nwt = [] # store the weights
    pnoise = 10 # percent of noise
    
    for t in range(no_trials):
        # Generate train data        
        Xr = np.column_stack((np.ones(N), 
                             np.random.uniform(-1.,1., size=(N, dim))))
        X = np.column_stack((Xr, Xr[:,1]*Xr[:,2],
                            np.square(Xr[:,1]), np.square(Xr[:,2])))
        Y = np.where((np.square(X[:,1]) + np.square(X[:,2]) - 0.6) >= 0,1,-1)
        points = random.sample(range(1, N), int(N*pnoise/100)) # outputs to be altered
        Y[points] = Y[points]*(-1) # add noise to the output
        
        # Compute the weights
        nw = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)),X.T),Y)
        nwt.append(nw)
        
        # Estimate hypothesis
        g = np.where(np.dot(X,nw)>=0, 1, -1)
        
        # Compute in-sample error
        Error = g - Y
        Ein = np.count_nonzero(Error)/N
        nEin_t.append(Ein)
        
        # Compute out-of-sample error
        Xtr = np.column_stack((np.ones(Nt),
                              np.random.uniform(-1.,1., size=(Nt, dim))))
        Xt = np.column_stack((Xtr, Xtr[:,1]*Xtr[:,2],
                            np.square(Xtr[:,1]), np.square(Xtr[:,2])))
        Yt = np.where((np.square(Xt[:,1]) + np.square(Xt[:,2]) - 0.6) >= 0,1,-1)
        Yt[points] = Yt[points]*(-1)
        
        gt = np.where(np.dot(Xt,nw)>=0, 1, -1)
        Error_t = gt - Yt
        Eout = np.count_nonzero(Error_t)/Nt
        nEout_t.append(Eout)
        
        nlr_out2 = {'nwt':nwt,
                  'nEin_t': nEin_t,
                  'nEout_t': nEout_t}
    
    return nlr_out2

In [12]:
# INPUTS
N = 1000 # number of training examples
Nt = 1000 # number of test examples
dim = 2 # dimension of X
no_trials = 1000 # number of times to run the experiment

nlr_out2 = nonlinregr2(N, Nt, dim, no_trials)
np.set_printoptions(precision=2)
proba = np.average(np.array(nlr_out2['nwt']), axis = 0)
proba

array([ -9.94e-01,   1.63e-03,   6.32e-04,   4.46e-03,   1.56e+00,
         1.56e+00])

Answer: (a)

### Question 10  
What is the closest value to the classification out-of-sample error $E_{out}$ of your hypothesis from Problem 9? (Estimate it by generating a new set of 1000 points and adding noise, as before. Average over 1000 runs to reduce the variation in your results.)

In [13]:
print('The value closes to the E_out is: ', 
      min([0, 0.1, 0.3, 0.5, 0.8], key=lambda x:abs(x-np.average(nlr_out2['nEout_t']))),
     ' with E_out equal to ', np.average(nlr_out2['nEout_t']))

The value closes to the E_out is:  0.1  with E_out equal to  0.126368


Answer: (b)