In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import polars as pl
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler



Loading Dataset

In [None]:
csv_path = 
df_total = pl.read_csv(csv_path)

In [None]:
col_exp_vars = ["Average Orientation (degrees)", "Fiber Length Density (1/um)", "Mean Fiber Length (nm)", "Mean Fiber Width (nm)"]
col_obj_vars = 

df_exp = df_total[col_exp_vars]
df_obj = df_total[col_obj_vars]

df_total.head()

In [None]:
plt.figure(figsize = (12,9))
sns.heatmap(
    df_total.to_pandas().corr(), annot=True, cmap="cividis'", fmt=".2f", linewidths=0.5
)

Here, the following data preprocessing is performed.
#### Step 1. Exclusion of outliers
An outlier is data that differs in trend from other data.
When outliers are included in the training data, the model tries to read their trends as well, which reduces generalization performance.

Therefore, they should be removed.

#### Step 2. Convert DataFrame to array
This is because label names are not necessary during training and it is sufficient to know which column corresponds to which column.

#### Step 3. Separate into training data and test data
Train on training data and test generalization performance on test data.

#### Step 4. Perform standardization
Standardization is to convert the mean to 0 and the variance to 1 within an j.
Specifically, the following equation is used for the conversion.
$$x_{j_{std}}[i] = \frac{x_{j}[i] - \mu_{j}}{\sqrt{\sigma^2{_{j}}}}$$
where,
- $j$ is the item name, 
- $i$ is the index in the item.
- $x_{j}[i]$ is the $i$-th data of item $j$
- $\mu_{j}=\Sigma_{i=1}^{N} x_{j}[i] / N$ is the mean in term $j$.
- $\sigma^2{_{j}}=\Sigma_{i=1}^{N} \left\{ x_{j}[i] - \mu_{j} \right\}^2 / N$ is the m variance in item $j$.

The advantage of standardization is that it allows for comparison by matching item-by-item scales.
<details>
<summary> Proofs for mean value and variance after standardization </summary>

$$
\begin{aligned}
\mu_{j_{std}} &= \frac{1}{N} \Sigma_{i=1}^{N} x_{j_{std}}[i]\\
&= \frac{1}{N} \Sigma_{i=1}^{N} \frac{x_{j}[i] - \mu_{j}}{\sqrt{\sigma^2{_{j}}}}\\
&= \frac{1}{\sqrt{\sigma^2{_{j}}}} \left\{ 
    \frac{\Sigma_{i=1}^{N} x_{j}[i]}{N} - 
    \frac{\Sigma_{i=1}^{N} \mu_{j}}{N}
\right\}\\
&= \frac{1}{\sqrt{\sigma^2{_{j}}}} \left(
    \mu_{j} - \mu_{j}
\right)\\
&= 0\\
\end{aligned}
$$

$$
\begin{aligned}
{\sigma^2}_{j_{std}} &= \frac{1}{N } \Sigma_{i=1}^{N} \left\{ x_{j_{std}}[i] - \mu_{j_{std}} \right\}^2\\
&= \frac{1}{N } \Sigma_{i=1}^{N} \left\{ x_{j_{std}}[i] - 0  \right\}^2\\
&= \frac{1}{N } \Sigma_{i=1}^{N} \left\{ \frac{x_{j}[i] - \mu_{j}}{\sqrt{\sigma^2{_{j}}}} \right\}^2\\
&= \frac{1}{\sigma^2{_{j}}} \frac{1}{N } \Sigma_{i=1}^{N} \left\{ x_{j}[i] - \mu_{j} \right\}^2\\
&= \frac{1}{\sigma^2{_{j}}} \cdot \sigma^2{_{j}}\\
&= 1\\
\end{aligned}
$$
</details>


In [None]:
#Step 1: Exclusion of outliers