<a name="cell-solving"></a>


1. [Singular Value Decomposition (SVD)](#cell-sovling-Axb-math-svd)
    1. [Understanding](#cell-sovling-Axb-UDVt) [$Ax$](#cell-sovling-Axb-UDVt) [with](#cell-sovling-Axb-UDVt) [$Ax=UDV^Tx$](#cell-sovling-Axb-UDVt)
    2. [Multicollinearity and Variance Inflation Factors (VIFs)](#cell-sovling-Axb-math-multicollinearity)
    3. [Principal Components Analysis (PCA)](#cell-sovling-Axb-math-pca)
    4. [Understanding PCA as](#cell-sovling-X-UDVt) [$X=UDV^T$](#cell-sovling-X-UDVt)
    5. [Principal Components Regression (PCR)](#cell-sovling-Axb-math-pcr)
    6. [[DELAYED] SVD Versus Eigendecomposition for PCA/PCR](#cell-sovling-Axb-math-pca-pcr)
    7. [[DELAYED] Scaling / Artifical Ill-Conditioning](#cell-sovling-condition-artificial)



In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html
import statsmodels.api as sm
# https://www.statsmodels.org/dev/datasets/index.html
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
# https://en.wikipedia.org/wiki/Variance_inflation_factor#Calculation_and_analysis

<a name="cell-sovling-Axb-math-multicollinearity"></a>

## 3.1.1 Multicollinearity and Variance Inflation Factors (VIFs) ([Return to TOC](#cell-solving))

---

The concept of ***multicollinearity***, the degree to which the covariates correlate, figures prominently in linear model regression contexts, and can be understood in terms of ***singular values***.

For $X_{n\times p} = U_{n\times r} D_{r \times r} (V^T)_{r\times p}$

- if $r<p$ then some columns of the design matrix $X$ are perfectly ***collinear*** and $X$ would not be statistically analyzed without alteration; but,
- even if $r=p$ so $X$ is ***full rank***, if some ***singular values*** $D_{jj}$ are relatively small then the corresponding ***loadings*** $U_{\cdot j}$ and ***basis vectors*** $V_{\cdot j}$ will not contribute significantly to $X$, causing the dominant construction of $X$ to be characteristically  ***non-full rank*** with

$$X_{n\times p} = U_{n\times p} D_{p \times p} (V^T)_{p\times p} \approx U_{n\times r} D_{r \times r} (V^T)_{r\times p} \;\text{ for } r<p$$

Thus the relative magnitudes of the ***singular values*** of $X$ can characterize the degree the ***multicollinearity*** present in $X$.

> The effect of ***multicollinearity*** is best quantified in terms of ***Variance Inflation Factors (VIFs)*** $\frac {1}{1-R_{j}^{2}}$ where
$R_{j}^{2}$ is the ***coefficient of determination*** for the regression of covariate $X_{j}$ on all other covariates except $X_{j}$
> 
> $${\displaystyle X_{j}= \alpha_{0}+\alpha_{1}X_{1} + \alpha_{2}X_{2}+\cdots + \underset{\text{no } X_j}{\alpha_{j-1}X_{j-1} + \alpha_{j+1}X_{j+1}}+\cdots\alpha_{k}X_{p}}+\epsilon$$
> 
> and is so named since it *inflates* the variation of coefficient estimates according to
> 
> $$ {\displaystyle {\widehat {\operatorname {var} }}({\hat {\beta }}_{j})=s^{2}[(X^{T}X)^{-1}]_{jj} =  {\frac {s^{2}}{(n-1){\widehat {\operatorname {var} }}(X_{j})}}\cdot \underset{VIF}{\frac {1}{1-R_{j}^{2}}}}$$
> 
> where $s^2$ is the ***residual variance*** of the original regression, and the more predictive covariates are of each other the larger $R^2_j$ is and the larger the ***VIF*** is. Thus, the greater the ***multicollinearity*** the greater the uncertainty in coefficient estimation (since attribution between correlated covariates is confounded).
>
> But notice that the increased uncertainty in the coefficients is also clearly seen by examing the ***singular values*** alone, since
>
> $${\widehat {\operatorname {var} }}({\hat {\beta }}_{j})=s^{2}[(X^{T}X)^{-1}]_{jj} = s^{2}[(VD^2V^T)^{-1}]_{jj} = s^{2}[(VD^{-2}V^T)]_{jj} = s^2 \sum_{i=1}^{p}V_{ji}^2D_i^{-2}$$
>
> shows that ${\widehat {\operatorname {var} }}({\hat {\beta }}_{j})$ is large if the corresponding $D_j$ ***singular value*** of $X$ is small. 

In [None]:
# very low multicollinearity so there's minimal variance inflation
n = 100; X = stats.multivariate_normal(cov=np.array(((1,.09),(.09,1)))).rvs(n)
X = sm.add_constant(X); Y = X.dot(np.ones(3)) + stats.norm().rvs(n)
model = sm.OLS(Y,X); results = model.fit(); results.summary2().tables[1]

In [None]:
U,D,Vt = np.linalg.svd(X, full_matrices=False)
XtXinv = np.linalg.inv((Vt.T*D**2).dot(Vt))
XtXinv_diag = (Vt.T**2*D**-2).sum(axis=1)
# residual error
s = float(results.summary2().tables[0].iloc[-1,-1])
np.sqrt(XtXinv_diag)*s
# notice that these match the Std. Err. above (although there is some roundoff error)
# confirming that these coefficient variances depend on the singular values

In [None]:
# https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html
VIF(X,0),VIF(X,1),VIF(X,2)

In [None]:
U,D,Vt = np.linalg.svd(X, full_matrices=False)
plt.bar(x=range(3), height=D); plt.title("Singular Values"); 

In [None]:
# This shows that the approximate reconstruction of X is very poor 
fix,ax = plt.subplots(1,2, figsize=(10,4))
ax[0].hist(np.abs(U.dot(np.diag(D)).dot(Vt) - X).flatten())
ax[0].set_title("Roundoff error in full reconstruction of X")
ax[1].hist(np.abs(U[:,:2].dot(np.diag(D[:2])).dot(Vt[:2,:])-X).flatten())
ax[1].set_title("Approximation error in compact reconstruction of X");

In [None]:
plt.plot((U[:,:2].dot(np.diag(D[:2])).dot(Vt[:2,:])).ravel(), X.ravel(),'.',
         label='Approximation error in compact reconstruction of X')
plt.plot((U.dot(np.diag(D)).dot(Vt)).ravel(), X.ravel(),'.',
         label='Roundoff error in full reconstruction of X')
plt.title("X is poorly constructed with a lower rank approximation")
plt.xlabel("Exact X"); plt.xlabel("Reconstructed X"); plt.legend();

In [32]:
# The standard errors are inflated for highly multicolliner X
n = 100; X = stats.multivariate_normal(cov=np.array(((1,.99),(.99,1)))).rvs(n)
X = sm.add_constant(X); Y = X.dot(np.ones(3)) + stats.norm().rvs(n)
model = sm.OLS(Y,X); results = model.fit(); results.summary2().tables[1]

In [32]:
U,D,Vt = np.linalg.svd(X, full_matrices=False)
XtXinv = np.linalg.inv((Vt.T*D**2).dot(Vt))
XtXinv_diag = (Vt.T**2*D**-2).sum(axis=1)
# residual error
s = float(results.summary2().tables[0].iloc[-1,-1])
np.sqrt(XtXinv_diag)*s
# notice that these match the Std. Err. above (although there is some roundoff error)
# confirming that these coefficient variances depend on the singular values

In [32]:
VIF(X,0),VIF(X,1),VIF(X,2)

In [32]:
U,D,Vt = np.linalg.svd(X, full_matrices=False)
plt.bar(x=range(3), height=D); plt.title("Singular Values"); 

In [32]:
# This shows that the approximate reconstruction of X is not as bad
# when one of the singular values is small and dropped from the SVD
fix,ax = plt.subplots(1,2, figsize=(10,4))
ax[0].hist(np.abs(U.dot(np.diag(D)).dot(Vt) - X).flatten())
ax[0].set_title("Roundoff error in full reconstruction of X")
ax[1].hist(np.abs(U[:,:2].dot(np.diag(D[:2])).dot(Vt[:2,:])-X).flatten())
ax[1].set_title("Approximation error in compact reconstruction of X");

In [33]:
plt.plot((U[:,:2].dot(np.diag(D[:2])).dot(Vt[:2,:])).ravel(), X.ravel(),'.',
         label='Approximation error in compact reconstruction of X')
plt.plot((U.dot(np.diag(D)).dot(Vt)).ravel(), X.ravel(),'.',
         label='Roundoff error in full reconstruction of X')
plt.title("X is fairly well constructed with a lower rank approximation")
plt.xlabel("Exact X"); plt.xlabel("Reconstructed X"); plt.legend();

<a name="cell-sovling-Axb-math-pca"></a>

## 3.1.2 Principal Components Analysis (PCA) ([Return to TOC](#cell-solving))

---

[***Principal Components Analysis***](https://en.wikipedia.org/wiki/Principal_component_analysis) (***PCA***) is an ***unsupervised learning*** methodology which uncovers linear structure underlying a data matrix $X_{n \times m}$. When doing ***PCA***, the "principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix". So 
> for the usual ***PCA*** standardization preprocessing, where the data columns of $X_{n \times m}$ are centered and scaled by their column means $\overline{X_{\cdot j}}$ and standard deviations $\hat \sigma_{X_{\cdot j}}$ as $\tilde X_{\cdot j}  = (X_{\cdot j} - \overline{X_{\cdot j}}) / \hat \sigma_{X_{\cdot j}}$ 

a ***PCA*** is computed as either

  $$\underbrace{\tilde X \div \sqrt{n} = U_{n \times m}D_{m \times m}(V^T)_{m \times m}}_{\text{singular value decomposition of the data matrix}}\quad\text{ or }\quad
\underbrace{\hat \Sigma = (\tilde X^T\tilde X \div n) = V_{m \times m}D^2_{m \times m}V^T_{m \times m}}_{\text{eigendecomposition of the data covariance matrix}}$$

which explicitly demonstrates [the connection](https://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca) between ***SVD*** and the ***eigendecomposition***. Namely, for the ***SVD*** $X=UDV^T$, the ***eigendecomposition*** of the so-called ***gramian matrix***

$$X^TX = V D^2 V^T$$

has ***eigenvalues*** which are the square of the ***singular values*** of $X$ and has the same ***(semi-)orthonormal*** matrix $V^T$ as ***SVD*** (where $V^T$ will be ***orthonormal*** if $X$ is ***full rank*** so the ***gramian*** is ***positive definite***).

In [33]:
# This is an example of a PCA analysis
mtcars = sm.datasets.get_rdataset("mtcars")
Xtilde = (mtcars.data - np.mean(mtcars.data))/np.std(mtcars.data)
# The actually PCA computation is the next line
U,D,Vt = np.linalg.svd(Xtilde/np.sqrt(Xtilde.shape[0]))

# Below is the code for standard PCA analyses for so called "Principle Components"
# The i^th column of U from the SVD is the i^th Principle Component
# A compact SVD approximate reconstruction of Xtilde would use 1st through i^th Principle Components
plt.figure(figsize=(18,6))
plt.subplot(131); plt.plot(D**2,'.'); plt.plot(D**2); plt.xlabel("Principle Component");
plt.ylabel("Singular Values of ~X"); plt.title("Variation Explained by Each Principle Component")
plt.subplot(132); 
p=2; plt.plot(U[:,:p].dot(np.diag(D[:p])).dot(Vt[:p,:]).flatten(), 
         Xtilde.values.flatten()/Xtilde.shape[0]**.5, '.', 
         label="Rebuild ~X with 2 PCs")
p=4; plt.plot(U[:,:p].dot(np.diag(D[:p])).dot(Vt[:p,:]).flatten(), 
         Xtilde.values.flatten()/Xtilde.shape[0]**.5, '.', 
         label="Rebuild ~X with 4 PCs")
p=11; plt.plot(U[:,:p].dot(np.diag(D[:p])).dot(Vt[:p,:]).flatten(), 
         Xtilde.values.flatten()/Xtilde.shape[0]**.5, '.', 
         label="Rebuild ~X with 11 PCs")
plt.legend(); plt.xlabel("Rebuilt value"); plt.ylabel("Original value"); 
plt.title("Reconstructing ~X from its Principle Components")
plt.subplot(133); plt.plot(U[:,0],U[:,1],'.'); 
plt.xlabel("Principle Component 1"); plt.ylabel("Principle Component 2");
plt.title("Interpreting Principle Components of ~X");
for i in range(0, U.shape[0]):
  if all(np.sum(((U[:i,:2]-U[i,:2])**2).dot([[1],[15]]), axis=1) > .02):
    plt.text(U[i,0],U[i,1],mtcars.data.index[i], horizontalalignment="right")
for j in range(Vt.shape[0]):
  plt.plot([0,Vt[0,j]],[0,Vt[1,j]],'k')
  plt.text(Vt[0,j], Vt[1,j], mtcars.data.columns[j], color='r')
# The third plot interprets the "Principle Directions" associated with each "Principle Component"
# Consdier the Xtilde space of mpg, cyl, disp, etc.
# In this space the first "Principle Component" (the first column of U from the SVD) 
# points most directly in the direction of the cyl and disp axes, and in the negative direction of the mpg axis
# The elements of the first column of U are the coordinates of the data points along this direction
# The blue points are data points, and you can see their locations relative to the Priniple Directions

In [None]:
# Xtilde is standardized, so var for each col is 1
np.var(Xtilde).sum(), (D**2).sum()
# D are the singular values of Xtilde 
# D**2 are the eigenvalues of covariance matrix from its eigendecomposition, 
#      which is equivalent to the singular values of the covariance matrix
# The sum of the eigenvalues of cov matrix is trace of the cov matrix 
# which is the sum of the variances for each column in the standardized Xtilde

***PCA*** is the ***SVD*** of $X$ which provides the principle directions in $X$ and the projections of $X$ onto this coordinate coordinate system are the ***principle components*** of $X$.

![](https://miro.medium.com/max/596/1*QinDfRawRskupf4mU5bYSA.png)

<a name="cell-sovling-X-UDVt"></a>

## 3.1.3 Understanding PCA as $X=UDV^T$ ([Return to TOC](#cell-solving))

---

For ***full rank*** $X_{n \times m}$, the ***SVD*** $X=UDV^T$ produces an ***orthonormal*** $V^T$ so that the data point

$$X_{i \cdot} = U_{i \cdot} DV^T \quad \text{ or } \quad x_i = \underbrace{VD}_{A}u_i$$ 

can be be represented by $U_{i \cdot}$ with respect to the ***orthogonal basis*** defined by the columns of $A=VD$. The $X_{i \cdot}$ data point is represented in the ***standard (orthonormal) basis***, but the columns $X_{\cdot j}$ are not necessarily ***linearly independent***; however, $U_{i \cdot}$ is represented in a different ***orthogonal basis*** where the columns $U_{\cdot j}$ are ***linearly independent*** since they are ***semi-orthonormal*** ($U^TU = I_{m \times m}$).

> Since a point in $m$-dimensional space can be equivalently defined by different ***bases***, the rows of $X$ and $U$ represent the same collection of points under different ***bases***; but, the fact that $U$ has ***linearly independent*** columns means that it has no ***multicollinearity***.

- Each column $U_{\cdot j} = \left[X  V D^{-1}\right]_{\cdot j} = \sum_{k=1}^m X_{\cdot k} V_{k j}/\lambda_j$ is a linear combination of the columns of $X$, so $U_{\cdot j}$ are new variables constructed as weighted averages of the columns of $X$. 

- Each $X_{\cdot j} = \left[U D V^T \right]_{\cdot j} = \sum_{k=1}^m \lambda_k U_{\cdot k} [V^T]_{k j} \approx \sum_{k=1}^r \lambda_k U_{\cdot k} [V^T]_{k j}$ is a linear combination of the columns of $U$, where the summation to $r<m$ shows $X_{n \times m}$ approximated as a lower rank reconstruction using the ***compact SVD*** form. 



In [None]:
i=1; X = stats.norm.rvs(size=(5,3))
U,d,Vt = np.linalg.svd(X, full_matrices=False)
print(X[i,:], "\n", (U[i,:]*d)@Vt)
# U[i,:]*d is the "broadcasted" version of U[i,:]@np.diag(D)
# https://numpy.org/doc/stable/user/basics.broadcasting.html

In [None]:
j=0
print(X @ (Vt.T)[:,[j]] / d[j])
print(U[:,[j]])

In [None]:
j=1
print(X[:,[j]])
print((U*d)@Vt[:,[j]])

<a name="cell-sovling-Axb-math-pcr"></a>

## 3.1.4 Principal Components Regression (PCR) ([Return to TOC](#cell-solving))

---

In addition to use as an ***unsupervised learning*** methodology, ***PCA*** can be applied in the context of linear regression in order to replace a ***linearly dependent*** $X$ with ***linearly independent (semi-orthogonal)*** $U$ which is either immediately available from ***SVD*** of $X$ or is readily computed from $V$ and $D$ of the ***eigendecomposition*** of the ***gramian*** $X^TX$ as 

$$X = UDV^T \Longrightarrow \underbrace{XVD^{-1} = UDV^TVD^{-1} = U}_{\text{expressed for a single data point }D^{-1}V^Tx_i = u_i}$$

While the general motivation to replace $X$ with $U$ is to avoid the attribution uncertainty associated with ***multicollinearity***, and where possible to improve interpretability (through "factors" of linear combinations of the original columns of $X$), $U$ is also preferable to $X$ since all it's ***singuar values*** are $1$ (as seen from the ***SVD*** of $U$ which is $U = U II$).

- All of these benefits are realized by viewing $x_i = Au_i$ as a ***linear transformation*** so each data point $x_i$ is represented by $u_i$ in the ***orthogonal basis*** $A=VD$ rather than the ***standard (orthonormal) basis***.

In [None]:
n = 100; X = stats.multivariate_normal(cov=np.array(((1,.999),(.999,1)))).rvs(n)
X = sm.add_constant(X); Y = X.dot(np.ones(3)) + stats.norm().rvs(n)
model = sm.OLS(Y,X); results = model.fit(); results.summary2().tables[0] # scale = residual variance
#float(results.summary2().tables[0].iloc[-1,-1])**.5

In [None]:
results.summary2().tables[1]

In [None]:
VIF(X,0),VIF(X,1),VIF(X,2)

In [None]:
U,D,Vt = np.linalg.svd(X, full_matrices=False)
plt.bar(x=range(3), height=D); plt.title("Singular Values"); 

In [None]:
# but now we use U rather than X -- same model fit accuracy scores
model = sm.OLS(Y,U); results = model.fit(); results.summary2().tables[0] # scale = residual variance
#float(results.summary2().tables[0].iloc[-1,-1])**.5

In [None]:
# but tighter standard errors
results.summary2().tables[1]

In [None]:
# because now there's no multicollinearity, so VIFs are 1
VIF(U,0),VIF(U,1),VIF(U,2)

In [None]:
# and the SVD of U=U*I*I so singular values of U are all 1
U2,D,Vt = np.linalg.svd(U, full_matrices=False)
plt.bar(x=range(3), height=D); plt.title("Singular Values"); 

<a name="cell-sovling-Axb-math-pca-pcr"></a>

## 3.1.5 [DELAYED] SVD Versus Eigendecomposition for PCA/PCR ([Return to TOC](#cell-solving))

---

The ***PCA*** analyses above were created on the basis of an ***SVD*** computation; however, as discussed in this [stackoverflow](https://stats.stackexchange.com/questions/314046/why-does-andrew-ng-prefer-to-use-svd-and-not-eig-of-covariance-matrix-to-do-pca) conversation, there are (at least) four distinct algorithms which could be used to compute $V^T$ and $D$ (and thus compute ***PCA***):

- `np.linalg.svd(X)`
- `np.linalg.svd(X^TX)`
- `np.linalg.eig(X^TX)`
- `np.linalg.eigh(X^TX)`

It turns out that `np.linalg.svd(X)` is the best choice for two reasons:

1. All computations generally entail ***roundoff error***, so calculations such as $X^TX$ should be avoided for numerical accuracy when possible.
  - Even the ***singular values***/***eigenvalues*** $D_{ii}^2$ (of $X^TX$) tend to be less precice compared to the numerical accuracy of ***singular values*** $D_{ii}$ (of $X$) since squaring introducess the possibility for ***roundoff error***.
3. The ratio of the largest to smallest ***singular values*** of $X$ can be small compared to that of $X^TX$ and if the ***spectral radius***
   
   $${\rho(X) = \max_i D_{ii}} > 1 > \min_j D_{jj}$$

   then if ***spectral radius*** of the ***gramian***

   $${\rho(X^TX) = \max_i D_{ii}^2} >> 1 >> \min_j D_{jj}^2$$

   then 
   
   $$\frac{\max_i D_{ii}}{\min_j D_{jj}} << \frac{\max_i D_{ii}^2}{\min_j D_{jj}^2}$$

  which is relevant for the topic of ***condition***, introduced next.

In [None]:
np.set_printoptions(precision=7)

n,m = 100,7
X = stats.norm().rvs(size=(n,m))
X = (X - X.mean(axis=0))/X.std(axis=0)
print("The total variance", X.var(axis=0).sum(), end=" ")
print("should match the sum of the eigenvalues") 
print("which is also equal to the sum of squares of singular values.\n")

# Traditional PCA: eigendecomposition of the covariance matrix
# np.linalg.eigh: "complex Hermitian (conjugate symmetric) or a real symmetric matrix."
eignvals = np.sort(np.linalg.eigh(np.cov(X.T, ddof=0))[0])[::-1]
print(eignvals, eignvals.sum())

# Traditional PCA: eigendecomposition of the covariance matrix
# np.linalg.eig: general eigendcomposition function
eignvals = np.sort(np.linalg.eig(np.cov(X.T, ddof=0))[0])[::-1]
print(eignvals, eignvals.sum())

# Traditional PCA: eigendecomposition of the covariance matrix
# SVD/eigendecomposition equivalence for symmetric positive definite matrices
eignvals = np.linalg.svd(np.cov(X.T, ddof=0))[1]
print(eignvals, eignvals.sum())

# Traditional PCA: eigendecomposition of the covariance matrix
# np.linalg.eigh with matrix-based covariance computation
eignvals = np.sort(np.linalg.eigh(X.T.dot(X)/n)[0])[::-1]
print(eignvals, eignvals.sum())

# Traditional PCA: eigendecomposition of the covariance matrix
# np.linalg.eig with matrix-based covariance computation
eignvals = np.sort(np.linalg.eig(X.T.dot(X)/n)[0])[::-1]
print(eignvals, eignvals.sum())

# Traditional PCA: eigendecomposition of the covariance matrix
# np.linalg.eig with matrix-based covariance computation
eignvals = np.linalg.svd(X.T.dot(X)/n)[1]
print(eignvals, eignvals.sum())

# PCA with SVD 
singvals = np.linalg.svd(X/n**0.5)[1]
print(singvals**2, (singvals**2).sum())

The spacing of the ***singular values*** of the ***gramian*** $X^TX = V D^2 V ^T$ will be much more spread out than those of $X= U D V ^T$ itself since they are squared. It's often therefore the case that 
  
$$\kappa(X^TX) = \frac{\lambda_{max}^2}{\lambda_{min}^2} >> \frac{\lambda_{max}}{\lambda_{min}} = \kappa(X)$$

so the ***condition number*** of the ***gramian*** $X^TX$ is *much* larger than the ***condition*** of $X$ itself.

- In ***PCA***, because of the generally increased ***condition number*** of the ***covariance matrix*** $\tilde X^T \tilde X$ relative to the scaled data matrix $\tilde X$, applying ***SVD*** to $\tilde X$ is more numerically stable than computing the ***eigendecomposition*** or ***SVD*** of $\tilde X^T \tilde X$.


- Constituant components of matrix decompositions like the ***orthogonal matrix*** $U$ in ***SVD*** $ A = U D V^T$ have smaller ***condition numbers*** than the original matrix $A$ because they factor out the ***singular values*** of $A$. For ***orthogonal matrix*** $U$ in ***SVD*** $ A = U D V^T$  

   $$U=U\Lambda W^T = U_{n\times m}I_{m \times m}I_{m \times m}$$
   
   so all ***singular values*** of the ***SVD*** $U$ are $\lambda_i = \Lambda_{ii} = I_{ii} = 1$ and $\kappa(U) = 1$.

  > In the context of linear regression, the ***condition*** of the  principal components of the ***SVD*** $X = UXV^T$ is $1=\kappa(U) \leq \kappa(X)$; thus, ***principle components regression*** is a mathematical formulation which specifies the use of a ***well-conditioned*** design matrix.

<a name="cell-sovling-condition-artificial"></a>

## [DELAYED] 3.2.1 Scaling / Artifical Ill-Conditioning ([Return to TOC](#cell-solving))

---

If the scales of two columns $X_{\cdot i}$ and $X_{\cdot j}$ of the matrix $X$ are very different, then it follows that to create these different scales 
on that basis of the ***SVD*** $X=UDV^T$ will require very different magnitude singular values $\lambda_i$ and $\lambda_i$.  This of course means that $X$ will be ***ill-conditioned***. However, in some contexts this may only mean that $X$ is ***artifically ill-conditioned*** because if the columns can be rescaled so they're no longer substantially differ than, then the ***artifical ill-conditioning*** would be removed.


<!-- This is because the sum of the ***singular values*** are [bounded](https://math.stackexchange.com/questions/1472420/upper-bound-for-the-sum-of-absolute-values-of-the-eigenvalues) by the sum of the elements of a matrix; thus,  -->

- The second instruction of the ever-present recommendation to "center and scale your data" in linear regression contexts is thus seen to address the ***artifical ill-conditioning*** problem. 

It turns out that ***artifical ill-conditioning*** also occurs if if the rows have drastically different scales.  So again, the ***artifical ill-conditioning*** can be removed by ensuring the scales of the rows do not drastically differ. 

- Systems of nonlinear equations will be considered later, but one approach to solving them is to use a ***multivariate first-order Taylor approximation*** to transform the problem back into a system of linear equations $$b = f(x) \approx f(x^{(t)}) + \overbrace{\left[\begin{array}{ccc}\frac{\partial f_1(x^{(t)})}{\partial x_1} & \cdots & \frac{\partial f_1(x^{(t)})}{\partial x_p} \\
\vdots & \ddots & \vdots \\
\frac{\partial f_n(x^{(t)})}{\partial x_1} & \cdots & \frac{\partial f_n(x^{(t)})}{\partial x_p} 
\end{array} \right]}^{\text{the Jacobian } J_{f(x)}(x^{(t)}) \,=\, \nabla_{f(x)}(x^{(t)})^T}  (x-x^{(t)})$$

    However, $J\tilde x = \tilde b$ will be ***artifically ill-conditioned*** if the scales of the ***Jacobian*** differ greatly from, e.g., row to row. If one row of parital derivatives (i.e., one particular outcome variable of $f$) has substantially different magnitudes from the other rows, the equations can be rescaled, i.e., to $c_ib_i = c_if_i(x)$, so that there are not drastically different magnitudes from one row of the  ***Jacobian*** to the next.