## loss_estimation.py

This file provides functions to calculate different types of loss estimators for a given model over a dataset. These include the naive loss, training/testing loss, and leave-one-out loss estimators. The functions can be used with any model that implements the `fit()` and `predict()` methods.

##### `naive_loss_estimation(model, X, y)`
Calculates the naive loss estimator for a given model over a given dataset.

##### Parameters:
- **model** (`object`): The model for which the naive loss will be calculated. The model must have `fit()` and `predict()` methods implemented.
- **X** (`numpy.ndarray`): A 2D array of shape `(n_samples, n_features)` representing the feature set of the dataset.
- **y** (`numpy.ndarray`): A 1D array of length `n_samples` representing the true response values corresponding to the features in `X`.

##### Returns:
- **naive_loss_estimate** (`float`): The naive loss estimate, calculated as the mean squared error (MSE) between the true and predicted values.



##### `train_test_loss_estimation(model, X, y, train_range, test_range)`
This function calculates the training/testing loss estimator for a given model over a dataset, using a training set and a test set. The model is trained on the training set and evaluated on the test set. The loss is calculated as the Mean Squared Error (MSE) between the predicted and actual responses on the test set.

##### Parameters:
- **model** (`object`): The model for which the naive loss will be calculated. The model must have `fit()` and `predict()` methods implemented.
- **X** (`numpy.ndarray`): A 2D array of shape `(n_samples, n_features)` representing the feature set of the dataset.
- **y** (`numpy.ndarray`): A 1D array of length `n_samples` representing the true response values corresponding to the features in `X`.
- **train_range** (`list`): The list of indices which will be used to train the model.
- **test_range** (`list`): The list of indices which will the MSE of the trained model will be calculated on.

##### Returns:
- **train_test_loss_estimate** (`float`): The training-testing loss estimate, calculated as the mean squared error (MSE) between the true and predicted values over the testing data-set using the model trained on the training data set.



##### `loss_test_loss_estimation(model, X, y))`
Calculates the leave-one-out (LOO) loss estimator for a given model over a given dataset. Assumes that the model is a linear model and utilizes the known closed form solution for the LOO loss estimator for linear models for computational efficiency.

##### Parameters:
- **model** (`object`): The model for which the naive loss will be calculated. The model must have `fit()` and `predict()` methods implemented.
- **X** (`numpy.ndarray`): A 2D array of shape `(n_samples, n_features)` representing the feature set of the dataset.
- **y** (`numpy.ndarray`): A 1D array of length `n_samples` representing the true response values corresponding to the features in `X`.

##### Returns:
- **loo_loss_estimate** (`float`): The leave-one-out loss estimate, calculated using the closed form solution which is known for linear models.


### Example
Below is a code snippet where the three loss-estimations are used in practice for an OLS-estimator using a generated data set. In the first and second outputs we showcase that when given the entire data set for training and testing, the training-testing loss estimator reduces to the naive loss estimator. 

In [15]:
import numpy as np
import sys
sys.path.append('..')
from stats_module import *

## Generate data:
- For $n=1000$ samples and $p = 10$ covariates, generate design matrix X as standard normal data.
- Generate $y = X\beta + e$, where $\beta$ is given and $e$ is from a standard normal distribution.
....

In [16]:
np.random.seed(0)
n = 1000
p = 10
X = np.random.randn(n,p)
beta = np.arange(1,p+1)
e = np.random.randn(n)
y = X @  beta + e

model = OLS(include_intercept=True)
model.fit(X, y)

AttributeError: module 'numpy' has no attribute 'randn'

In [None]:
print("Naive:\t\t\t" +              str(naive_loss_estimation(model,X,y)))
print("Train-Test (full/full):\t" + str(train_test_loss_estimation(model, X, y, list(range(1,1000)), list(range(1,1000)) )))
print("Train-Test (half/half)):" +  str(train_test_loss_estimation(model, X, y, list(range(1,500)), list(range(500,1000)) )))
print("Leave-one-out:\t\t" +        str(loo_loss_estimation(model, X, y)))

## LinearModelTester.py

The file holds a class for performing hypothesis tests and building confidence intervals on a fitted gaussian homoscedastic linear model.

### Initialization:
#### LinearModelTester(model)
##### Parameters:
- **model** (`object`): A fitted linear model object with:
    - **$\beta$** (`numpy.ndarray`): Estimated coefficients of the model.
    - **include_intercept** (`bool`): Whether the model includes an intercept term.
##### Raises:
- ValueError: If the model is not fitted (**\beta** is None).


### Methods:
#### `hypothesis_t_test(X, y, null_hypothesis, alpha=0.05)`:
Perform a t-test for individual coefficients.
##### Parameters:
- **X** (`numpy.ndarray`): Feature matrix $(n x p)$.
- **y** (`numpy.ndarray`): Response vector $(n x 1)$.
- **null_hypothesis** (`numpy.ndarray`): Hypothesized values of coefficients.
- **$\alpha$** (`float`): Significance level (default 0.05).
##### Returns:
- List of dictionaries with:
    - **coefficient** (`int`): Index of the coefficient.
    - **beta_estimate** (`numpy.ndarray`): Estimated value.
    - **null_value** (`float`): Null hypothesis value for the coefficient.
    - **t_stat** (`float`): T-statistic.
    - **p_value** (`float`): P-value for t-statistic at significance level $\alpha$.
    - **reject_null** (`bool`): Whether the null hypothesis is rejected.

#### `hypothesis_F_test(X, y, R, r, alpha=0.05)`:
Perform an F-test for hypotheses of the form $R\beta = r$.
##### Parameters:
- **X** (`numpy.ndarray`): Feature matrix $(n x p)$.
- **y** (`numpy.ndarray`): Response vector $(n x 1)$.
- **R** (`numpy.ndarray`): Constraint matrix $(k x p)$.
- **r** (`numpy.ndarray`): Constraint vector $(k x 1)$.
- **$\alpha$** (`float`): Significance level (default 0.05).
##### Returns:
- Dictionary with:
    - **F_stat** (`float`): F-statistic.
    - **p_value** (`float`): P-value for F-statistic at significance level $\alpha$.
    - **reject_null** (`bool`): Whether the null hypothesis is rejected.

#### `confidence_interval(X, y, alpha=0.05)`:
Construct confidence intervals for model coefficients.
##### Parameters:
- **X** (`numpy.ndarray`): Feature matrix $(n x p)$.
- **y** (`numpy.ndarray`): Response vector $(n x 1)$.
- **$\alpha$** (`float`): Significance level (default 0.05).
##### Returns:
- List of dictionaries with:
    - **coefficient** (`int`): Index of the coefficient.
    - **beta_estimate** (`float`): Estimated value of the coefficient.
    - **confidence_lower** (`float`): Lower bound of the $1-\alpha$ confidence interval.
    - **confidence_upper** (`float`): Upper bound of the $1-\alpha$ confidence interval.

#### `prediction_interval_m(X, y, x_new, alpha=0.05)`:
Construct a confidence interval for $m(x_{new}) = x_{new}^\top\beta$ at a new point ($x_{new}$).
##### Parameters:
- **X** (`numpy.ndarray`): Feature matrix $(n x p)$.
- **y** (`numpy.ndarray`): Response vector $(n x 1)$.
- **x_new** (`numpy.ndarray`): New feature vector $(1 x p)$.
- **$\alpha$** (`float`): Significance level (default 0.05).
#### Returns:
- Dictionary with:
    - **mx_new_estimate** (`np.ndarray`): Estimated $m(x_{new})$.
    - **confidence_lower** (`float`): Lower bound of the $1-\alpha$ confidence interval.
    - **confidence_upper** (`float`): Upper bound of the $1-\alpha$ confidence interval.

#### `prediction_interval_y(X, y, x_new, alpha=0.05)`:
Construct a confidence interval for a new observation, $y_{new}$.
##### Parameters:
- **X** (`numpy.ndarray`): Feature matrix $(n x p)$.
- **y** (`numpy.ndarray`): Response vector $(n x 1)$.
- **x_new** (`numpy.ndarray`): New feature vector $(1 x p)$.
- **$\alpha$** (`float`): Significance level (default 0.05).
#### Returns
- Dictionary with:
    - **mx_new_estimate** (`np.ndarray`): Estimated $m(x_{new})$.
    - **confidence_lower** (`float`): Lower bound of the $1-\alpha$ confidence interval for $y_{new}$.
    - **confidence_upper** (`float`): Upper bound of the $1-\alpha$ confidence interval for $y_{new}$.

### Example:

In [9]:
#generate a random dataset
np.random.seed(0)
n = 1000
p = 10
X = np.randn(n,p)
beta = np.arange(1,p+1)
e = np.randn(n)
y = X @  beta + e

#fit an OLS estimator
model = OLS(include_intercept=True)
model.fit(X, y)

#generate new point to predict
x_new = np.random.randn(1, p)

summary = model.summary(X,y)
print(summary)


{'coefficients': array([-0.41221599,  0.93623639,  2.38843176,  2.99360915]), 'r_squared': np.float64(0.9786838149519005)}


In [10]:
#hypothesis testing on the coefficients

tester = LinearModelTester(model)
H0 = np.array([0, 1, 2, 3])
alpha = 0.05

# test H0 for each coefficient

results = tester.hypothesis_t_test(X, y, H0, alpha)
for result in results:
    print(f"Coefficient {result['coefficient']}:")
    print(f"  Estimated: {result['beta_estimate']}")
    print(f"  Null value: {result['null_value']}")
    print(f"  t-stat: {result['t_stat']}")
    print(f"  p-value: {result['p_value']}")
    print(f"  Reject null: {result['reject_null']}")

# build confidence intervals for coefficients
results = tester.confidence_interval(X, y, alpha)
for result in results:
    print('--'*40)
    print(f"Coefficient {result['coefficient']}:")
    print(f"  Estimated: {result['beta_estimate']}")
    print(f"  Confidence interval: [{result['confidence_lower']}, {result['confidence_upper']}]")

# hypothesis testing on linear combinations of coefficients
R = np.array([
    [0, 0, 1, 0],
    [0, 1, 0, 1]
])
r = [0, 0] 

# H0 = Rbeta = r

results = tester.hypothesis_F_test(X, y, R, r, alpha)
print('--'*40)
print(f"F-stat: {results['F_stat']}")
print(f"p-value: {results['p_value']}")
print(f"Reject null: {results['reject_null']}")



Coefficient 0:
  Estimated: -0.41221598820940447
  Null value: 0
  t-stat: -2.440361254246725
  p-value: 0.026682263977906073
  Reject null: True
Coefficient 1:
  Estimated: 0.9362363940133275
  Null value: 1
  t-stat: -0.46340266445100076
  p-value: 0.6493169758975523
  Reject null: False
Coefficient 2:
  Estimated: 2.388431760388897
  Null value: 2
  t-stat: 2.4816537047952063
  p-value: 0.024563519243711474
  Reject null: True
Coefficient 3:
  Estimated: 2.993609153939517
  Null value: 3
  t-stat: -0.03907216395463594
  p-value: 0.9693162326966738
  Reject null: False
--------------------------------------------------------------------------------
Coefficient 0:
  Estimated: -0.41221598820940447
  Confidence interval: [-0.7703018479599026, -0.0541301284589063]
--------------------------------------------------------------------------------
Coefficient 1:
  Estimated: 0.9362363940133275
  Confidence interval: [0.6445401725668481, 1.2279326154598067]
----------------------------------

In [13]:
alpha = 0.05
result = tester.prediction_interval_m(X, y, x_new, alpha)
print('--'*40)
print(f"Prediction interval for new point m{x_new}:")
print(f"  Estimated m(x_new): {result['mx_new_estimate']}")
print(f"  Confidence interval: [{result['confidence_lower']}, {result['confidence_upper']}]")

result = tester.prediction_interval_y(X, y, x_new, alpha)
print('--'*40)
print(f"Prediction interval for response of new point, y_new:")
print(f"  Confidence interval: [{result['confidence_lower']}, {result['confidence_upper']}]")


--------------------------------------------------------------------------------
Prediction interval for new point m[[-1.16514984  0.90082649  0.46566244]]:
  Estimated m(x_new): 2.0425022606342313
  Confidence interval: [1.4122759816245984, 2.672728539643864]
--------------------------------------------------------------------------------
Prediction interval for response of new point, y_new:
  Confidence interval: [0.44011628066077035, 3.6448882406076923]
