# Code

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

In [2]:
cancer: np.ndarray = load_breast_cancer()

In [3]:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

In [4]:
print(X_train.shape)
print(X_test.shape)

(426, 30)
(143, 30)


## Preprocessing Data

- Makes the data compatible with the models
- `MinMaxScaler` - Transform features by scaling each feature to a given range
  - This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.


In [5]:
scaler: MinMaxScaler = MinMaxScaler()

- `fit()` computes the minimum and maximum value of each feature of the training set

In [6]:
scaler.fit(X_train)

- The `transform()` method applies the transformation

In [7]:
X_train_scaled: np.ndarray = scaler.transform(X_train) # transform training data

In [8]:
print(X_train_scaled.shape)
print(X_train.min(axis=0)) # min value per feature before scaling
print(X_train.max(axis=0)) # max value per feature before scaling
print(X_train_scaled.min(axis=0)) # min value per feature after scaling
print(X_train_scaled.max(axis=0)) # max value per feature after scaling

(426, 30)
[6.981e+00 9.710e+00 4.379e+01 1.435e+02 5.263e-02 1.938e-02 0.000e+00
 0.000e+00 1.060e-01 4.996e-02 1.115e-01 3.628e-01 7.570e-01 7.228e+00
 1.713e-03 2.252e-03 0.000e+00 0.000e+00 7.882e-03 8.948e-04 7.930e+00
 1.202e+01 5.041e+01 1.852e+02 7.117e-02 2.729e-02 0.000e+00 0.000e+00
 1.565e-01 5.504e-02]
[2.811e+01 3.381e+01 1.885e+02 2.501e+03 1.447e-01 3.114e-01 4.268e-01
 2.012e-01 3.040e-01 9.744e-02 2.873e+00 4.885e+00 2.198e+01 5.422e+02
 2.333e-02 1.064e-01 3.960e-01 5.279e-02 6.146e-02 2.984e-02 3.604e+01
 4.954e+01 2.512e+02 4.254e+03 2.226e-01 1.058e+00 1.252e+00 2.903e-01
 6.638e-01 2.075e-01]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


In [9]:
X_test_scaled: np.ndarray = scaler.transform(X_test) # transform test data

In [10]:
print(X_test_scaled.shape)
print(X_test.min(axis=0)) # min value per feature before scaling
print(X_test.max(axis=0)) # max value per feature before scaling
print(X_test_scaled.min(axis=0)) # min value per feature after scaling
print(X_test_scaled.max(axis=0)) # max value per feature after scaling

(143, 30)
[7.729e+00 1.072e+01 4.798e+01 1.788e+02 6.576e-02 3.398e-02 0.000e+00
 0.000e+00 1.203e-01 5.024e-02 1.144e-01 3.602e-01 7.714e-01 6.802e+00
 2.826e-03 3.746e-03 0.000e+00 0.000e+00 1.013e-02 1.217e-03 8.964e+00
 1.249e+01 5.717e+01 2.422e+02 8.409e-02 4.619e-02 0.000e+00 0.000e+00
 1.603e-01 5.865e-02]
[2.321e+01 3.928e+01 1.535e+02 1.670e+03 1.634e-01 3.454e-01 4.264e-01
 1.823e-01 2.906e-01 9.502e-02 1.370e+00 3.647e+00 1.107e+01 1.765e+02
 3.113e-02 1.354e-01 1.438e-01 4.090e-02 7.895e-02 2.193e-02 3.101e+01
 4.487e+01 2.068e+02 2.944e+03 1.902e-01 9.327e-01 1.170e+00 2.910e-01
 5.440e-01 1.446e-01]
[ 0.03540158  0.04190871  0.02895446  0.01497349  0.14260888  0.04999658
  0.          0.          0.07222222  0.00589722  0.00105015 -0.00057494
  0.00067851 -0.0007963   0.05148726  0.01434497  0.          0.
  0.04195752  0.01113138  0.03678406  0.01252665  0.03366702  0.01400904
  0.08531995  0.01833687  0.          0.          0.00749064  0.02367834]
[0.76809125 1.226970

- For the test set, after scaling, the minimum and maximum are not 0 and 1
- `MinMaxScaler` (and all the other scalers) always applies exactly the same transformation to the training and the test set
- This means the transform method always subtracts the training set minimum and divides by the training set range, which might be different from the minimum and range for the test set

## SVM

In [11]:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

In [12]:
svm: SVC = SVC(C=100)
svm.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(svm.score(X_train_scaled, y_train)))

Accuracy on training set: 1.000


In [13]:
X_test_scaled: np.ndarray = scaler.transform(X_test)
svm.fit(X_test_scaled, y_test)
print("Accuracy on test set: {:.3f}".format(svm.score(X_test_scaled, y_test)))

Accuracy on test set: 1.000


## Parameter Selection using Validation & Cross Validation 

In [14]:
from sklearn.datasets import load_iris

In [15]:
iris: np.ndarray = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

- `SVM()` takes 2 arguments 
  - `gamma` - kernel bandwidth 
  - `C` - regularization parameter

In [16]:
best_score: float = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]: # try different values for gamma
	for C in [0.001, 0.01, 0.1, 1, 10, 100]: # try different values for C
		svm: SVC = SVC(gamma=gamma, C=C) # build the model
		svm.fit(X_train, y_train) # train the model
		score: float = svm.score(X_test, y_test) # evaluate the model on the test set
		if score > best_score: # if we got a better score, store the score and parameters
			best_score = score # store the best score
			best_parameters = {'C': C, 'gamma': gamma} # store the best parameters

In [17]:
print("Best score: {:.2f}".format(best_score))
print("Best parameters: {}".format(best_parameters))

Best score: 0.97
Best parameters: {'C': 100, 'gamma': 0.001}


### Using a Validation Set

- Repeating same procedure as before but with a validation set
- Training set is split into 2 parts, the training set and the validation set

In [18]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0) # split the data into training and test sets
X_train_pr, X_valid, y_train_pr, y_valid = train_test_split(X_train, y_train, random_state=1) # split training set into training and validation sets

In [19]:
print("Size of training set: {} \nSize of validation set: {} \nSize of test set: {}".format(X_train_pr.shape[0], X_valid.shape[0], X_test.shape[0]))

Size of training set: 84 
Size of validation set: 28 
Size of test set: 38


In [20]:
best_score: float = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]: # try different values for gamma
	for C in [0.001, 0.01, 0.1, 1, 10, 100]: # try different values for C
		svm: SVC = SVC(gamma=gamma, C=C) # build the model 
		svm.fit(X_train_pr, y_train_pr) # train the model on the training set 
		score: float = svm.score(X_valid, y_valid) # evaluate the model on the validation set
		if score > best_score: # if we got a better score, store the score and parameters
			best_score = score # store the best score
			best_parameters = {'C': C, 'gamma': gamma} # store best parameters (will use **kwargs)

In [21]:
svm: SVC = SVC(**best_parameters) # build a model with best parameters (**kwargs)
svm.fit(X_train, y_train) # fit the model using the whole training set
test_score: float = svm.score(X_test, y_test) # evaluate the model on the test set

In [22]:
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: {}".format(best_parameters))

Best score on validation set: 0.96
Best parameters: {'C': 10, 'gamma': 0.001}


### Using Cross Validation

In [23]:
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
	for C in [0.001, 0.01, 0.1, 1, 10, 100]:
		svm: SVC = SVC(gamma=gamma, C=C)
		score: float = np.mean(cross_val_score(svm, X_train, y_train, cv=5))
		if score > best_score:
			best_score = score
			best_C = C
			best_gamma = gamma

In [24]:
svm = SVC(C=best_C, gamma=best_gamma)
svm.fit(X_train, y_train)
test_score = svm.score(X_test, y_test)

In [25]:
print("Best cross-validation score: {:.2f}".format(best_score))
print("Best parameters: C = {}, gamma = {}".format(best_C, best_gamma))
print("Test set score with best parameters: {:.2f}".format(test_score))

Best cross-validation score: 0.97
Best parameters: C = 10, gamma = 0.1
Test set score with best parameters: 0.97


- `GridSearchCV` implements the grid search with-cross validation
- It will perform all the necessary model fits

In [26]:
from sklearn.model_selection import GridSearchCV

- A dictionary is required

In [27]:
param_grid: dict[str, list[float]] = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

- Behaves similarly to a classifier as it can call:
  - `fit`
  - `predict`
  - `score`
- However, when calling `fit`, it will run cross-validation for each combination of parameters which was specified in `param_grid`

In [28]:
grid_search: GridSearchCV = GridSearchCV(SVC(), param_grid, cv=5)

- Data still needs to be split

In [29]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)

In [30]:
print("Test set score: {:.2f}".format(grid_search.score(X_test, y_test)))

NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'C': 10, 'gamma': 0.1}
Best cross-validation score: 0.97


In [None]:
print(grid_search.best_estimator_)

SVC(C=10, gamma=0.1)


# Revision Questions 

## Question 1
What is meant by data normalization in machine learning? (Remember that in this course “normalization” is understood in the wide sense and includes the transformations performed by `Normalizer`, `StandardScaler`, etc., in `scikit-learn`.)

- Transforming the dataset so that it is compatible with the models

## Question 2
Why is normalization of features not essential for the method of Least Squares?

- Least Square is not sensitive to the scale of features
$$ J(θ) = 1/2m ∑i=1 (hθ(x(i)) − y(i))^2 $$ 
  - where $hθ$ is the hypothesis function and $x(i)$ is the $ith$ training example
- The only term in the cost function that depends on the feature $x$ is $hθ(x(i))$
  - The cost function is only sensitive to the relative values of the features, not the absolute values
  - Hence, if the features were normalized, only the scale of the features would be changed but not the relative values
- Therefore, the function would not be affected and we would get the same result

## Question 3
Why is normalization of features essential for Ridge Regression and the Lasso?

- As mentioned before, both Ridge Regression and Lasso regularize the cost function by adding a term to the cost function that penalizes the coefficients for being too large
- These methods are designed to prevent overfitting by reducing the magnitude of the coefficients
- If the features are not normalized, then the regularization term will penalize the coefficients differenctly depending on the scale of the features
- For example, there are 2 features $x_1$ and $x_2$
  - $x_1$ is from range 0 to 1
  - $x_2$ is in range 0 to 1000 (possibly because of different units)
  - In this case, the regularization term will penalize $x_2$ much more than $x_1$ as it is much larger
  - This would then over-represent $x_1$ in the final model

## Question 4
Briefly describe the class `StandardScaler` in `scikit-learn`, paying particular attention to its fit and transform methods.

- The `StandardScaler` ensures for each feature, the mean is 0 and the variance is 1
- This brings all the features to the same scale
- Steps:
  - Shift each feature down by its mean
  - Divide each feature by its standard deviation

## Question 5
Briefly describe the class `RobustScaler` in `scikit-learn`, paying particular attention to its fit and transform methods.

- Because `StandardScaler` uses the mean and variance, there is chance that it may be inaccurate due to outliers
- The `RobustScaler` uses the median and quartiles (interquartile range) ignoring outliers
- Steps:
  - Shift each feature down by its median
  - Divide each feature by its interquartile range

## Question 6
Briefly describe the class `MinMaxScaler` in `scikit-learn`, paying particular attention to its fit and transform methods.

- The `MinMax` shifts all the features such that all the feature range is between 0 and 1

## Question 7
Briefly describe the class `Normalizer` in `scikit-learn`, paying particular attention to its fit and transform methods

- Instead of normalizing features, samples are normalized instead
- Each sample is divided by its Euclidean norm
- This means that normalizing the training and test sets not required (automatic)

## Question 8
Give an example of a dataset for which the use of the class Normalizer has a better justification than the use of classes performing normalization of features (such as `StandardScaler`).

- 

## Question 9
Consider the following training set:
| Feature 1  | Feature 2  | Label  |
| -----------  | -----------  |  -----------  |
| -3 | 2 | Male |
| 0 | 5 | Female |
| 3 | 8 | Male |
| 0 | 8 | Male |

What is its normalized version, in the sense of `MinMaxScaler`? Apply the same transformation to the test set
| Feature 1 | Feature 2 |
| ----------- | ----------- |
| 1 | -1 |
| 0 | 4 |
| 2 | 5 |

1. Shift all features by the smallest number (smallest feature in training set becomes 0)
2. Divide all features by biggest feature (largest feature in training set becomes 1)

**Training Set**
*Feature 1*
| Original | Shift | Shifted | Divide | Divided |
| ----------- | ----------- |  ----------- |  ----------- |  ----------- |
| -3 | +3 | 0 | ÷6 | 0 |
| 0 | +3 | 3 | ÷6 | 1/2 |
| 3 | +3 | 6 | ÷6 | 1 |
| 0 | +3 | 3 | ÷6 | 1/2 |

*Feature 2*
| Original | Shift | Shifted | Divide | Divided |
| ----------- | ----------- |  ----------- |  ----------- |  ----------- |
| 2 | +2 | 0 | ÷6 | 0 |
| 5 | +2 | 3 | ÷6 | 1/2 |
| 8 | +2 | 6 | ÷6 | 1 |
| 8 | +2 | 3 | ÷6 | 1 |

**Test Set**
- Apply Feature 1 (training set) transformation to Feature 1 (test set)
- Same for feature 2

*Feature 1*
| Original | Shift | Shifted | Divide | Divided |
| ----------- | ----------- |  ----------- |  ----------- |  ----------- |
| 1 | +3 | 4 | ÷6 | 2/3 |
| 0 | +3 | 3 | ÷6 | 1/2 |
| 2 | +3 | 5 | ÷6 | 5/6 |

*Feature 2*
| Original | Shift | Shifted | Divide | Divided |
| ----------- | ----------- |  ----------- |  ----------- |  ----------- |
| -1 | +2 | -3 | ÷6 | -1/2 |
| 4 | +2 | 2 | ÷6 | 1/3 |
| 5 | +2 | 3 | ÷6 | 1/2 |

## Question 10
For the training set
| Feature 1  | Feature 2  | Label  |
| -----------  | -----------  |  -----------  |
| -10 | 0 | 1.6 |
| 10 | 2 | 2.8 |

find its normalized version in the sense of `StandardScaler`. Apply the same transformation (emulating the transform method) to the test set
| Feature 1 | Feature 2 |
| ----------- | ----------- |
| -20 | -2 |
| 10 | 4 |
| 0 | 0 |

1. Shift each feature down by its mean
2. Divide each feature by its standard deviation

**Training Set**
*Feature 1*

Mean = $\frac{(-10)+(10)}{2}=0$
Standard Deviation = $\sqrt{\frac{((-10)-(0))^2+((10)-(0))^2}{2}}=10$

| Original | Mean | Shifted | Standard Deviation | Normalized |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| -10 | -0 | -10 | ÷10 | -1 |
| 10 | -0 | 10 | ÷10 | 1 |

*Feature 2*

Mean = $\frac{0+2}{2}=1$
Standard Deviation = $\sqrt{\frac{((0)-(1))^2+((2)-(1))^2}{2}}=1$

| Original | Mean | Shifted | Standard Deviation | Normalized |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| 0 | -1 | -1 | ÷1 | -1 |
| 2 | -1 | 1 | ÷1 | 1 |

**Test Set**
- Apply training sets transformation to test sets transformation corresponding the each feature

*Feature 1*
| Original | Mean | Shifted | Standard Deviation | Normalized |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| -20 | -0 | -20 | ÷10 | -2 |
| 10 | -0 | 10 | ÷10 | 1 |
| 0 | -0 | 0 | ÷10 | 0 |

*Feature 2*
| Original | Mean | Shifted | Standard Deviation | Normalized |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| -2 | -1 | -3 | ÷1 | -3 |
| 4 | -1 | 3 | ÷1 | 3 |
| 0 | -1 | -1 | ÷1 | -1 |

## Question 11
Consider the following training set:
| Feature 1 | Feature 2 |
| ----------- | ----------- |
| -3 | 4 |
| 4 | 3 |
| 4 | 4 |

What is its normalized version, in the sense of `Normalizer`? Apply the same transformation to the test set
| Feature 1 | Feature 2 |
| ----------- | ----------- |
| -4 | 3 |
| 3 | -3 |

- For each sample (row), compute the Euclidean distance and divide

**Training Set**
| Feature 1 | Feature 2 | Euclidean Distance | Feature 1 Transformed | Feature 2 Transformed |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| -3 | 4 | $÷\sqrt{(-3)^2+(4)^2}=5$ | -3/5 | 4/5 |
| 4 | 3 | $÷\sqrt{(4)^2+(3)^2}=5$ | 4/5 | 3/5 |
| 4 | 4 | $÷\sqrt{(4)^2+(4)^2}=4\sqrt{2}$ | $\sqrt{2}/2$ | $\sqrt{2}/2$ |

**Test Set**
| Feature 1 | Feature 2 | Euclidean Distance | Feature 1 Transformed | Feature 2 Transformed |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| -4 | 3 | $÷\sqrt{(-4)^2+(3)^2}=5$ | -4/5 | 3/5 |
| 3 | 3 | $÷\sqrt{(3)^2+(3)^2}=3\sqrt{2}$ | $\sqrt{2}/2$ | $\sqrt{2}/2$ |

## Question 12
What is meant by data snooping in machine learning?

- **Data Snooping** - using test set to develop model

## Question 13
What is wrong with the following code for data normalization?
```py
X = MinMaxScaler().fit_transform(boston.data)
X_train, X_test, y_train, y_test = train_test_split(X,boston.target)
```

- Split the data first instead of applying normalization and then splitting as this can be classed as data snooping
```py
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target)
scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
```

## Question 14
Discuss disadvantages of normalizing the training and test sets separately when using classes `StandardScaler`, `RobustScaler`, and `MinMaxScaler`.

- If the test set is normalized separately, then it will not conform with the training set 
- This means that the model is not able to accurately generalize the data as it is inconsistent

## Question 15
Is it admissible to normalize training and test set separately when using the class `Normalizer`? Explain briefly why or why not.

- Yes, it is possible
- This is because instead of each feature being normalized individually, the whole sample (row) is normalized independently
- Each sample has its own unique normalization and does not depend on the normalization of another sample

## Question 16
What is wrong with the following code for data normalization?
```py
X_train, X_test, y_train, y_test = train_test_split(X, boston.target)
X_train = MinMaxScaler().fit_transform(X_train)
X_test = MinMaxScaler().fit_transform(X_test)
```

```py
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target)
scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
```

## Question 17
Explain the use of a validation set for parameter selection in machine learning.

- **Validation Set** - used to estimate how well the model will generalize to new data
  - Helps select the model with the best performance and avoid overfitting
- This would require some data that has not been using for validation 
  - However, the test set cannot be used as this is data snooping
  - Hence, a new pseudo test set is created out of the training set

## Question 18
What are disadvantages of the use of the test set for parameter selection (i.e., of choosing the parameters that give the best results on the test set)?

- The test set should not be available to the model
- If it is, then it would be data snooping

## Question 19
Explain the use of cross-validation for parameter selection in machine learning.

- Splitting data into training set and test set does not always return the same result depending on how the data was split (eg `random_state`)
- By this logic, splitting the training set into training set proper and validation set also has the same problems
- Hence, after the split, cross validation is used to evaluate the performance of the model

## Question 20
How would you perform parameter selection using grid search in a hierarchical manner to improve its computational efficiency?

- Start with a small and coarse grid (as grid searches are computationally expensive)
- Ideally, the values at the centre of the first grid should be significantly better than the values at the boundary 
  - Less adjustments for the small grid
- Then, a finer grid is used to isolate the area of good performance

## Question 21
(The entries in the table are the accuracy of the algorithm for different values of the parameters.) How suitable was this grid for selecting the optimal values of the two parameters? Explain briefly why.

- The gaps between gamma and `C` are too small hence the values are too close together
  - The grid is too small
- A logarithmic scale (0.001, 0.01, 0.1, ...) would be much better

## Question 22
Answer Question 21 for the following grid for parameters A and B:

- Much better as there is greater range of values as the size of the grid has been increased by using a logarithmic scale
- This can still be improved as the values on the bottom right edge (A -> 100, B -> 100) are higher in accuracy 
- Those values should be in the middle of the grid

## Question 23
Answer Question 21 for the following grid for parameters A and B:

- This grid is good because the optimal values are in the middle of the grid 

## Question 24
List three desiderata for the method of inductive conformal prediction. Which of them are satisfied automatically?

1. The method should allow for accurate prediction of new data.
2. The method should be computationally efficient.
3. The method should be able to handle non-linear data.

1. and 2. are satisfied automatically, while 3. is not.

## Question 25
Briefly explain why conformal prediction is not feasible in combination with feature normalization and parameter selection.

- When processing is required, then the conformal prediction will need be redone for each test sample and for each potential label for it (to achieve guaranteed validity)
- True for parameter selection

## Question 26
Compare and contrast conformal prediction and inductive conformal prediction.

**Conformal Prediction**
- More computationally intensive
- Requires the assumption of exchangeability
- More accurate predictions

**Inductive Conformal Prediction**
- Computationally efficient
- Automatically valid under the IID assumption 
- May suffer a drop in predictive efficiency

> Inductive conformal prediction relies on the assumption of IID data, while conformal prediction relies on the assumption of exchangeability. Conformal prediction is also more computationally intensive.

## Question 27
Compare and contrast conformal prediction and cross-conformal prediction.

**Conformal Prediction**
- More computationally intensive
- Requires the assumption of exchangeability
- More accurate predictions

**Cross-Conformal Prediction**
- Less computationally intensive
- Does not require the assumption of exchangeability
- Less accurate predictions

## Question 28
Compare and contrast inductive conformal prediction and cross-conformal prediction.

**Inductive Conformal Prediction**
- More computationally intensive
- Requires the assumption of exchangeability
- More accurate predictions
- More robust to non-stationarity

**Cross-Conformal Prediction**:
- Less computationally intensive
- Does not require the assumption of exchangeability
- Less accurate predictions

## Question 29
What is an inductive conformity measure? Define the inductive conformal predictor based on a given inductive conformity measure.

- An inductive conformity measure is a function that quantifies the degree to which a prediction is consistent with a set of data
- The inductive conformal predictor is a function that predicts the value of a new data point based on the inductive conformity measure
---
- Let $C$ be an inductive conformity measure, and let $x$ be a new data point. Then the inductive conformal predictor is given by:
$$p(x) = C(x, D)$$
  - where $D$ is the set of data points used to train the predictor.

## Question 30
What is an inductive nonconformity measure? Define the inductive conformal predictor based on a given inductive nonconformity measure.

- An inductive nonconformity measure is a function that quantifies the degree to which a prediction is inconsistent with a set of data
- The inductive conformal predictor is a function that predicts the value of a new data point based on the inductive nonconformity measure

---
Let $C$ be an inductive nonconformity measure, and let $x$ be a new data point. Then the inductive conformal predictor is given by:
$$p(x) = C(x, D)$$
  - where $D$ is the set of data points used to train the predictor.

## Question 31
Give three examples of inductive nonconformity measures.

1. Incidence of nonconforming items
2. Severity of nonconformities
3. Rate of occurrence of nonconformities

## Question 32
Give two examples of inductive conformity measures.

1. Percentage of conforming items
2. Percentage of conforming batches

## Question 33
In the context of inductive conformal prediction, what is the minimal possible p-value for a training set proper of size $n−m$ and calibration set of size $m$?

$$\frac{1}{m+1}$$