# Exercise 1


Sometimes it is important to regress on count data, that is $Y_i$ corresponds to the count of something, taking values $0,1,\ldots$. A reasonable distribution for count data is the Poisson distribution, that is we could consider
$$
    Y_i \mid X_i \sim \text{Poisson}(\lambda(X_i)), \text{ where $\lambda(X_i) = G(\beta_0 + \beta_1 X_i)$}
$$
where $G(x) = e^x$. The reason why $G(x) = e^x$ is two-fold, the first is that it always gives positive values no matter $x$, which fits the parameters space of the Poisson distribution, the second reason is that in practice it tends to be a better model for count data. Think of $X_i$ as denoting the presense or absence of something, then the rate-parameter $\lambda(0) = e^{\beta_0}$, and in the presense of $X_i = 1$ it becomes $\lambda(1) = e^{\beta_0 + \beta_1} = e^{\beta_0}e^{\beta_1} = \lambda(0)e^{\beta_1}$, thus it is multiplicative. That is the presence of $X_i$ changes the rate with a constant factor (this is called a proportional model).

Recall that a random variable $X \sim Poisson(\lambda)$ if its probability mass function is:

$$
f(x; \lambda) = \exp{(-\lambda)} \frac{\lambda^x}{x!}, \quad \lambda > 0, \quad x \in \{0,1,2,\ldots\}
$$

The assignment for you now is to do the motions from above, i.e. derive the conditional likelihood and apply it to a problem by filling in the missing parts of the code below.

Hint, derive the log-likelihood on paper first then get rid of the factorial term.

In [1]:
import numpy as np
from scipy import optimize

# do not change next two lines - this is the X,Y data
X_samples= np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
Y_samples= np.array([16, 14, 16, 11, 16, 14, 9, 13, 13, 6, 9, 12, 6, 7, 5, 3, 4, 4, 2, 5])

# finding MLE for Poisson Regression without the factorial part
# do not Change the name of the next function - just replace XXX
def negLogLklOPoissonRegression_wo_factorial(X,Y,params):
    '''Calculate the negative log-likelihood for Poisson regression without the factorial part'''
    beta0 = params[0]
    beta1 = params[1]
    
    # Calculate lambda(X_i) for each X_i
    lambda_X = np.exp(beta0 + beta1 * X)
    
    # Calculate the negative log-likelihood
    nll = np.sum(lambda_X - Y * (beta0 + beta1 * X))
    
    return nll

# you should only change XXX below and not anything else
parameter_bounding_box=((-5.0, 5.0), (-1.0, 1.0)) # specify the constraints for each parameter - some guess work.
initial_arguments = np.array([0.0, 0.0]) # point in 2D to initialise the minimize algorithm

# Create a function that can be sent into the optimizer
negLogLklOPoissonRegression_wo_factorial_XY = lambda params: negLogLklOPoissonRegression_wo_factorial(X_samples,Y_samples,params)

# Optimize using scipy's minimize function
result_Assignment4Problem1 = optimize.minimize(negLogLklOPoissonRegression_wo_factorial_XY, 
                                              initial_arguments, bounds=parameter_bounding_box)

# Print the result
print(result_Assignment4Problem1)


  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: -245.55959584262357
        x: [ 2.888e+00 -8.115e-02]
      nit: 9
      jac: [-2.842e-06 -8.527e-06]
     nfev: 45
     njev: 15
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>


# Exercise 2


Consider an instance space $X$ consisting of integers $1$ to $1000$ and a target concept $c^\ast = \{x: 501 \leq x \leq 1000\}$. If your hypothesis class $\mathcal H$ is $\{h_j:\, h_j=\{x: j \leq x \leq 1000\}, j=1,\ldots,1000\}$. How large must the training set $S$ be to ensure that with probability $99\%$ any consistent hypothesis (training error 0) will have a true error less than $10\%$. (Hint: use the theorem above).

In [5]:
import numpy as np
from scipy.optimize import fsolve

# Define the equation based on the bound
def equation(N):
    term1 = 50 * (1000 * np.log(2 * N / 1000) + 2)
    return term1 - N

# Use fsolve to find the value of N that satisfies the equation
N2 = fsolve(equation, 1000)  # Start guess at N=1000
print(f"Required number of training samples: {N2[0]}")


Required number of training samples: 504.05693966478793


# Exercise 3 


The following datasets are subsets of $d$-dimensional '0/1' vectors with label +1. The remaining '0/1' vectors are '-1'. Determine if the following three problems are linearly separable:

1. $c^\ast = \{(0,1,0),(0,1,1),(1,0,0),(1,1,1)\}$, and $X$ is all $0/1$ vectors in 3 dimensions.
2. $c^\ast = \{(0,1,1),(0,1,0),(1,1,0),(1,1,1)\}$, and $X$ is all $0/1$ vectors in 3 dimensions.
3. $$\begin{align*}
      c^\ast = \{(0,1,0,0),(0,1,0,1),(0,1,1,0),(1,0,0,0)\\
      ,(1,1,0,0),(1,1,0,1),(1,1,1,0),(1,1,1,1)\}
    \end{align*}$$
    and $X$ is all $0/1$ vectors in 4 dimensions.

In [8]:
# Replace with True if true and False if false, they represent the three problems
Solution_Q1 = True
Solution_Q2 = False
Solution_Q3 = True

To determine if these datasets are **linearly separable**, we need to assess whether there exists a hyperplane (a linear decision boundary) that can separate the positive examples (labeled as +1) from the negative examples (labeled as -1). A dataset is **linearly separable** if such a hyperplane exists.

### Problem 1
- **Dataset**: \( c^\ast = \{(0,1,0), (0,1,1), (1,0,0), (1,1,1)\} \)
- **Instance space** \( X \): All \( 0/1 \) vectors in 3 dimensions (i.e., \( 2^3 = 8 \) total vectors).
- The positive examples are: \( (0,1,0), (0,1,1), (1,0,0), (1,1,1) \).
- The remaining \( 4 \) vectors are the negative examples: \( (0,0,0), (0,0,1), (1,0,1), (1,1,0) \).

Now, let’s determine if the dataset is linearly separable:
- The positive and negative points are distributed in a way that suggests there might be a hyperplane that can separate them. For example, you can draw a line in 3D space that separates the positive and negative examples.

### Problem 2
- **Dataset**: \( c^\ast = \{(0,1,1), (0,1,0), (1,1,0), (1,1,1)\} \)
- **Instance space** \( X \): All \( 0/1 \) vectors in 3 dimensions (i.e., \( 2^3 = 8 \) total vectors).
- The positive examples are: \( (0,1,1), (0,1,0), (1,1,0), (1,1,1) \).
- The negative examples are: \( (0,0,0), (0,0,1), (1,0,0), (1,0,1) \).

This is a more challenging case. The distribution of positive and negative points suggests that the positive points form a square in the space, and separating this from the negative points may not be possible with a single linear decision boundary.

### Problem 3
- **Dataset**: 
  \[
  c^\ast = \{(0,1,0,0), (0,1,0,1), (0,1,1,0), (1,0,0,0), (1,1,0,0), (1,1,0,1), (1,1,1,0), (1,1,1,1)\}
  \]
- **Instance space** \( X \): All \( 0/1 \) vectors in 4 dimensions (i.e., \( 2^4 = 16 \) total vectors).
- The positive examples are: \( (0,1,0,0), (0,1,0,1), (0,1,1,0), (1,0,0,0), (1,1,0,0), (1,1,0,1), (1,1,1,0), (1,1,1,1) \).
- The negative examples are the remaining vectors.

This is a straightforward case because the set of positive points corresponds to all vectors where the second and third coordinates are at least 1. This can be separated easily with a linear boundary.

### Conclusion
Now let's summarize the results:
- **Problem 1** is linearly separable: **True**
- **Problem 2** is **not** linearly separable: **False**
- **Problem 3** is linearly separable: **True**


---
#### Local Test for Assignment 4, PROBLEM 3
Evaluate cell below to make sure your answer is valid.                         You **should not** modify anything in the cell below when evaluating it to do a local test of                         your solution.
You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.

# Exercise 4


In Problem 2 we looked at the following problem:
Consider an instance space $X$ consisting of integers $1$ to $1000$ and a target concept $c^\ast = \{x: 501 \leq x \leq 1000\}$. If your hypothesis class $\mathcal H$ is $\{h_j:\, h_j=\{x: j \leq x \leq 1000\}, j=1,\ldots,1000\}$. 

What is the VC-dimension of our hypothesis class? That is, what is the maximum number of points such that $\mathcal H$ shatters that set?

In [10]:
# Define the number of points in the set X
n_points = 1000

# VC dimension is the maximum number of points that can be shattered
VC_dimension = n_points

print(f"The VC-dimension of the hypothesis class is: {VC_dimension}")

The VC-dimension of the hypothesis class is: 1000


# Exercise 5


Revisit problem 2 using the VC_dimension that you used in Problem 4, by instead applying the VC-bound.
Consider an instance space $X$ consisting of integers $1$ to $1000$ and a target concept $c^\ast = \{x: 501 \leq x \leq 1000\}$. If your hypothesis class $\mathcal H$ is $\{h_j:\, h_j=\{x: j \leq x \leq 1000\}, j=1,\ldots,1000\}$. How large must the training set $S$ be to ensure that with probability $99\%$ any consistent hypothesis (training error 0) will have a true error less than $10\%$.

In [16]:
import numpy as np
from scipy.optimize import fsolve

# Define parameters
VC_dimension = 1000
epsilon = 0.1
delta = 0.01

# Define the VC-bound equation
def vc_bound_equation(N):
    term1 = (VC_dimension * np.log(2 * N / VC_dimension)) + np.log(1 / delta)
    return (1 / epsilon) * term1 - N

# Solve for N
N5_solution = fsolve(vc_bound_equation, 10000)  # Initial guess for N
N5 = int(np.ceil(N5_solution[0]))  # Ensure the result is an integer

print(f"Required training set size (N5): {N5}")


Required training set size (N5): 10000


  term1 = (VC_dimension * np.log(2 * N / VC_dimension)) + np.log(1 / delta)
  improvement from the last ten iterations.
  N5_solution = fsolve(vc_bound_equation, 10000)  # Initial guess for N
