# Terminology

## Features
* $\phi_l$ = language
* $\phi_{tr1}$ = train set 1
* $\phi_{tr2}$ = train set 2
* $\phi_{t}$ = test set
* $\phi_{s1}$ = size of train set 1
* $\phi_{s2}$ = size of train set 2

## Factors
* $D_1$ = size of train set 1
* $D_2$ = size of train set 2
* $J_1$ = JSD between train set 1 and test set
* $J_2$ = JSD between train set 2 and test set
* $J_{12}$ = JSD between train set 1 and train set 2
* $L$ = language relatedness (TBD)

## Record
* [M] A datapoint with specific features and sp-BLEU score defined.
* [P] A cell in the sheet of sp-BLEU scores.

## Slice
* [M] A set of records that share similar selected
features.
* [P] A portion of the dataset that share similar selected features.
* For example, the slice $\mathcal{S} = \langle \phi_{tr1} = \text{cc_align}, \phi_{tr2} = \text{bible} \rangle$ contains all records that have the first training on cc_align and second training on bible.

## Subslice
* [M] A subset of a slice that share additional selected feature(s).
* [P] A smaller portion of the extracted portion of the dataset that share additional common selected feature(s).
* For example, the slice $\sigma = \langle \phi_{tr1} = \text{cc_align}, \phi_{tr2} = \text{bible}, \phi_t = \text{flores} \rangle$ is a subset of $\mathcal{S}$ where the language model is tested on flores.

## Slice Group
* [M] A collection of slices that share at least one similar feature.
* [P] A bigger portion of the dataset that contains the slice entirely, and all the data share at least one similar feature.
* For example, the slice $\Sigma = \langle \phi_{tr1} = \text{cc_align} \rangle$ is a slice group of $\mathcal{S}$ where all records have the first training on cc_align.

Note: Slice group is NOT exactly a superset, because it must contain all records that share the same selected features. If a record belongs to a specific slice and that slice is part of a slice group, then all the records within that slice must also be included in the slice group.This ensures that the slice group maintains consistency and includes all the records with similar selected features.

## Fold
* A distinct subset of the all experimental records (including the fits and costs). All records in a fold need not to agree on their features unless specified.
* If we specified ways of partition the folds, then it essentially become slices.
* To avoid confusion, we will still refer them as fold in analysis, but not modelling.

## Slice_vars
* The list of variables/ features that defines a slice.

## x_vars
* The list of variables that is use in modelling, i.e., appear in the trial equation.

## Trial_func
* A math equation (can be single or multivariable) that you want to test.

## Trial
* Trying a trial_func.
* Must think about test_func, slice_vars, x_vars, and bounds when constructing a trial

## Fit
* The coefficients/ constants in the test_func

# Prelim Results (Updated 6/14, 10am)

Variable(s) | Trial Equation | Average RMSE
--- | --- | ---
$D_1$ | $\alpha D_1 + C$ | 0.1176
| $C \log (\alpha D_1) + β$ | 0.0514
| $\alpha \left(\frac{1}{D_1} + C \right) ^{p}$ | 0.0297
$D_2$ | $\alpha D_2 + C$ | 1.8239
| $C \log (\alpha D_2) + \beta$ | 0.3006
| $\alpha \left(\frac{1}{D_2} + C \right) ^{p}$ |0.0472
$D_1, D_2$ | $ \beta_1 D_1 + \beta_2 D_2 + C $ | 1.8325
| $\alpha (D_1)^{-p_1} \cdot (D_2)^{-p_2} + C $ | 0.7218
| $$\alpha_1 (D_1D_2)^{-p_1} + \alpha_2 D_2 ^ {-p_2} + C$$ | 0.5680
| $\begin{cases}
  c_1 D_1 + c_2 D_2 + C &, D_1 > 10k \\
  c_2 D_2 + C &, \text{otherwise}
\end{cases} $ | 1.8506
$j_1$ | $\alpha j_1 + C$ | 9.1929
$j_2$ | $\alpha j_2 + C$ | 3.3017
$j_1, j_2$ | $\beta_1 j_1 + \beta_2 j_2 + C$ | 2.3928


---



Let $p_i^{(1)}$ = prob of word $i, \mathcal{w}_i$ in dataset 1, $\mathcal{D}_1$ \\
Let $p_j^{(2)}$ = prob of word $j, \mathcal{w}_j$ in dataset 2, $\mathcal{D}_2$
$$\bar{p}_k = \begin{cases}
\frac{1}{2} \left( p_k^{(1)} + p_k^{(2)} \right) & \text {if $\mathcal{w}_k \in \mathcal{D}_1 \cap \mathcal{D}_2$} \\
p_k^{(1)} & \text{if $\mathcal{w}_k \in \mathcal{D}_1 \setminus \mathcal{D}_2$} \\
p_k^{(2)} & \text{if $\mathcal{w}_k \in \mathcal{D}_2 \setminus \mathcal{D}_1$} \\
\end{cases}$$

$$KL^{(x)} = \sum (\bar{p}_k - p_k^{(x)})\log \left( \frac{\bar{p}_k}{p_k^{(x)}} \right), x \in \{1,2\}$$

$$ JSD = \log (0.5 (KL^{(1)} + KL^{(2)})) $$

# $k$-fold Cross Validation
This analysis is to provide insight when combining the factors.



## Big Idea

1.   Partition the experimental records of $\langle \phi_l, \phi_{tr1}, \phi_{tr2}, \phi_t, \phi_{s1}, \phi_{s2} \rangle$ tuples into $k$ folds.
  * In Neubig's paper, the partition was random.
  * But I think we want to do it systematically first. (So a fold means a slide group here)
  * For example, for Expr 1A, the records are considered in the same fold if they have identical language, train set 1, train set 2, and test set, i.e.,
  $$\mathcal{S}^{(i)} = \langle \phi_l^{(i)}, \phi_{tr1}^{(i)}, \phi_{tr2}^{(i)}, \phi_t^{(i)} \rangle$$ Let $\mathcal{S}$ be the set of all possible tuples.


2.   Train predictor on $k-1$ folds, i.e., get the average fits (value of coefficients) of each fold.
  * For example, consider the fold $\mathcal{S}^{(i)} = \langle \text {ka, cc_align, PMO, flores} \rangle$.
  * In Expr 1A-D1, we have 4 graphs in this fold, correspond to $D_2 = 1k, 10k, 25k, 50k$.
  * In Expr 1A-D1-Trial1, we have two fits, namely, $\alpha$ and $C$.
  * Denote $\bar{\alpha}_{\langle \text {ka, cc_align, PMO, flores} \rangle} = \mathbb{E}[\alpha_{\langle \text {ka, cc_align, PMO, flores} \rangle}]$ and $C_{\langle \text {ka, cc_align, PMO, flores} \rangle} = \mathbb{E}[C_{\langle \text {ka, cc_align, PMO, flores} \rangle}]$.

3. Combine average fits from all $k-1$ folds (depending on which attempt), test on the remaining fold and evaluate.
  * For example, suppose the left out fold was $\mathcal{S}^{(k)}$ then we test the equation
  $$\hat{\text{sp-BLEU}}_{\mathcal{S}^{(k)}}(D_1) = \hat{\alpha}_{\mathcal{S}^{(k)}} D_1 + \hat{C}_{\mathcal{S}^{(k)}}$$
  on the four graphs correspond to each $D_2$ values in $\mathcal{S}^{(k)}$. \\
  Here, the fit estimators $$\hat{\alpha}_{\mathcal{S}^{(k)}} = f(\bar{\alpha}_{\mathcal{S}^{(1)}}, ..., \bar{\alpha}_{\mathcal{S}^{(k-1)}}) \quad \text{and} \quad \hat{C}_{\mathcal{S}^{(k)}} = f(\bar{C}_{\mathcal{S}^{(1)}}, ..., \bar{C}_{\mathcal{S}^{(k-1)}})$$
  where $f$ is define differently in each attempt.
  * Evaluate by taking the average RMSE from the fitting the equation on the four graphs.
  $$\bar{\text{RMSE}}_{\mathcal{S}^{(k)}} = \frac{1}{|D_2|}\sum_{d \in D_2}\text{RMSE}_{\mathcal{S}^{(k)}, D_2 = d}$$

### Attempt 1: Simple Average
For each fit, simply take the average, i.e., for all fit $\theta$
$$\hat{\theta}_{\mathcal{S}^{(k)}} = \sum_{i = 1}^{k-1} \frac{\bar{\theta}_{\mathcal{S}^{(i)}}}{k-1}$$
We should expect high average RMSE from testing the remaining fold with this fit estimator.

### Attempt 2: Inclusive-Delta on $\phi$
For each feature $\phi$, only include folds that share the same feature $\phi$ as the remaining fold. \\
For example, if we attempt to use Inclusive-Delta on language feature, $\phi_l$, then
$$\hat{\theta}_{\mathcal{S}^{(k)}} = \sum_{i=1}^{k-1} \frac{\delta(\phi_l^{(i)} =\phi_l^{(k)}) \cdot \bar{\theta}_{\mathcal{S}^{(i)}}}{\delta(\phi_l^{(i)} =\phi_l^{(k)})}$$
#### Rationale

* See which feature give smaller RMSE when testing on the last fold.
* If average RMSE from testing the remaining fold with this estimator is LOW, that means we SHOULD include folds that share this feature with the remaining fold when training the predictor. In other words, this feature $\phi$ is important (and more weight?) for the predictor to be accurate.


### Attempt 3: Exlusive-Delta on $\phi$
Opposite to attempt 2, we consider all folds that do not share the same feature $\phi$ as the remaining fold. \\
For example, if we attempt to use Exclusive-Delta on language feature, $\phi_l$, then
$$\hat{\theta}_{\mathcal{S}^{(k)}} = \sum_{i=1}^{k-1} \frac{\delta(\phi_l^{(i)} \neq \phi_l^{(k)}) \cdot \bar{\theta}_{\mathcal{S}^{(i)}}}{\delta(\phi_l^{(i)} =\phi_l^{(k)})}$$
If average RMSE from testing the remaining fold with this estimator is LOW, that means we should NOT include folds that share this feature with the remaining fold when training the predictor. In other words this feature $\phi$ is not important (and less weight?) for the predictor to be accurate.

## Insight to Decision Tree
Something like
* If feature X agree, then we consider factor Y
* Is it a waste of time to attempt delta on multiple features?
* Instead of delta, weighted?

# Intuition

* $\text{sp-BLEU}$ increases with $D_i$
* $\text{sp-BLEU}$ decreases with $j_i$

# Random ideas

* Believe j is dependent on D, when combine, maybe can try something like
$$ \text{sp-BLEU} (D_1, D_2, j_1, j_2) = \beta_1 \left(\frac{D_1}{j_1}\right) + \beta_2 \left(\frac{D_2}{j_2}\right) + C$$

* But it seems like $\alpha (D_1)^{-p_1} \cdot (D_2)^{-p_2}$ works for $D_i$, let's keep the idea and maybe
$$ \text{sp-BLEU} (D_1, D_2, j_1, j_2) = \alpha \left(\frac{D_1}{j_1}\right)^{-p_1} \cdot \left(\frac{D_2}{j_2}\right)^{-p_2} + C$$

* But wait a minute $\text{sp-BLEU}$ has something to do with tokenizer, maybe it makes more sense to add the $j$s? (What Anthony did in [ScalingLaws](https://colab.research.google.com/drive/1Rx6sExWQ9RsNQeoHwBSmzIP2D-XvtMRy#scrollTo=ZXJpR9qlKYbe)-> Scaling Law + JSD )
$$ \text{sp-BLEU}(D_1, D_2, j_1, j_2) = \alpha (D_1)^{-p_1} \cdot (D_2)^{-p_2} + \beta_1 j_1 + \beta_2 j_2 + C_{lang+test} $$

To do any trial,
1. Create experiment
2. .fit_all() # Only need to do once
3. .read_all_fits()
4. .plot_all()
5. .analyze_all()