[![colab-logo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/preferred-medicine/medical-ai-course-materials/blob/master/notebooks/Introduction_to_ML_libs.ipynb)

#  Basics of machine learning 

In this chapter, we introduce the representative machine learning algorithms and the points of how to use them, together with the mathematical background. Let's think about algorithms of **single regression analysis** and **multiple regression analysis** together with mathematical expressions as a practice to acquire the concept of machine learning . By learning these, you will see how to use differentiation, linear algebra, and statistics. There are many places in the neural network which will be introduced in the next chapter that will be based on the concept of multiple regression analysis.


## Single regression analysis 

First, we will discuss single regression analysis, which is one of the most basic methods of machine learning algorithms. Machine learning algorithms are roughly divided into **supervised learning** and **unsupervised learning**, and simple regression analysis is a type of supervised learning. As a typical problem of supervised learning,10Or0.1There are **regressions** to predict numerical values (strictly continuous values) as in, and **classification** to predict category values like red wine or white wine . Simple regression analysis, as the name implies, is a method that handles regression, and is a machine learning algorithm that predicts one output variable from one input variable.


### Problem setting (single regression analysis) 
In machine learning, learning is performed based on data, but it is necessary for humans to decide what to use and what to predict from the information contained in the data.

Here, as an example, let us consider the problem of predicting rent. Therefore, rent is an **output variable** $y$ It becomes.

Then, the **input variables** consider what should be adopted as. For rent forecasting, for example, the room size, the distance from the station, and the crime rate can be considered as input variables. Here we input the size of the room variable $x$ Let's adopt as. In practice, when there are multiple input variable candidates, modeling that can handle all of them is generally used, but this will be introduced later in the multiple regression analysis.

In the machine learning algorithm, each method is roughly divided into the following three steps.

- Step 1: Determine the model
- Step 2: Determine the objective function
- Step 3: Find the optimal parameters

We will explain the above three steps in order.

### Step 1. Determine the model (single regression analysis) 

First, decide the **model** in Step1 . Model is an output variable $y$ And input variables $x$ It is a **formulation** of the relationship of . How can we formulate the rent and predict it well? Currently, this model design is generally performed manually, and machines do not automatically determine this (in recent years, research has also progressed to automatically determine models such as AutoML).

For example, in a given data set, suppose that the relationship between rent and room size is as follows.


![家賃と部屋の広さの関係](https://raw.githubusercontent.com/preferred-medicine/medical-ai-course-materials/master/notebooks/images/2/01.png)

In this case, the larger the room, the higher the rent, and it seems reasonable to use a straight line for prediction.


![直線式によるモデル化](https://raw.githubusercontent.com/preferred-medicine/medical-ai-course-materials/master/notebooks/images/2/02.png)

In this case, the straight line is adopted as a model, and the model of Step 1 is formulated as follows.

$$
y = wx + b
$$

here $w$ Is inclined, $b$ Is a parameter called intercept (in machine learning, **weight** is the slope ) $w$, **Bias the segment (bias)** $b$ It is common to express with the symbol).

In single regression analysis, the model is thus straight $y = wx + b$ Decide the weight𝑤And bias $b$ Adjust the to fit the data well.

The goal of many machine learning is to use a model characterized by such parameters and to find the optimal parameters to fit a given **data set** . Here, the data set is the size of the room, which is an input variable $x$ And the rent to be teacher data $t$ A set of data consisting of a set of $y$ , Give something as teacher data $t$ And I use it separately).

The data set is $\mathcal{D} = \{x_n, t_n\}_{n=1}^{N}$ It may also be represented as. Where subscript𝑛 ($n=1,2,\ldots,N$) Is $n$ It means the second property,$N$ Is the total number of properties. this $N$ Is called the **number of samples** .

Here, we will introduce a technique called **data centering** to facilitate the subsequent calculations . As shown in the figure below, the room size and the rent both have positive values, so it looks like the graph on the left. Centering means **averaging** $\boldsymbol{0}$ It performs conversion processing such as placing it in the center. This centralization is generally performed as preprocessing in many algorithms. Strictly speaking, centralization scaling described in the previous chapter is often used.


![中心化処理](https://raw.githubusercontent.com/preferred-medicine/medical-ai-course-materials/master/notebooks/images/2/03.png)

As one of the reasons for this process, as shown in the figure below, bias is due to centralization of data $b$But $0$ And $y_{c} = wx_{c}$ It can be mentioned that the model can be expressed without bias components, as this can reduce the parameters to be adjusted.


![中心化後の直線式](https://raw.githubusercontent.com/preferred-medicine/medical-ai-course-materials/master/notebooks/images/2/04.png)

Centralization of the data is achieved by subtracting the average of the input and output from the whole of the data. In other words,

$$
\begin{aligned}
x_{c} &= x - \bar{x} \\
t_{c} &= t - \bar{t}
\end{aligned}
$$

It becomes. For example, looking at specific numbers, it is as shown in the figure below.

![中心化前後の数値比較](https://raw.githubusercontent.com/preferred-medicine/medical-ai-course-materials/master/notebooks/images/2/05.png)

Indexed to indicate after centering 𝑐Because the expression is redundant with respect to, it is assumed in the future that this subscript is omitted and data centralization is performed in advance. At this time, the model is

$$
y = wx
$$

And the goal of simple regression analysis is the data set $\mathcal{D} = \{x_n, t_n\}_{n=1}^{N}$ Parameters based on 𝑤The right will be adjusted.


### Step 2. Determine the objective function (single regression analysis) 

As explained in Chapter 1, supervised learning often involves designing an objective function and learning the model by minimizing (or maximizing) the objective function.

This time, the goal is to match the teacher data and the predicted value, and the squared error of the teacher data and the predicted value is used as an objective function to express it. The squared error is0And only when $t = y$ It can be said that we have achieved perfect prediction. $n$ Teacher data for the second property $t_{n}$ And predicted value $y_{n}$ The squared error of

$$
(t_{n} - y_{n})^{2}
$$

It becomes. Since it is necessary to consider this for all properties, the final objective function takes its sum

$$
\begin{aligned}
\mathcal{L}&=\left( t_{1}-y_{1}\right)^{2}+\left( t_{2}-y_{2}\right)^{2}+\ldots + (t_{N}-y_{N})^{2} \\
&=\sum^{N}_{n=1}\left( t_{n}-y_{n}\right)^{2}\\
\end{aligned}
$$

It becomes. Also, the model decided in Step 1

$$
y_{n} = wx_{n}
$$

The objective function is
$$
\mathcal{L}=\sum^{N}_{n=1}\left( t_{n}-wx_{n}\right)^{2}
$$

It can be expressed in a form that includes and parameters. Remember that such a function is called a loss function.


### Step 3. Find the optimal parameter (single regression analysis) 

The last step is to find parameters that minimize the objective function. Here we have already learned that differentiation can be used as a way to find points to minimize a function. In the case of the square of the difference like this time, the point at which “Slope 0” is differentiated is the loss.0The point is The derivative of the objective function is as follows.

$$
\begin{aligned}
\dfrac{\partial }{\partial w} \mathcal{L}  &= \dfrac{\partial}{\partial w} { \sum^{N}_{n=1} ( t_{n}-wx_{n})^{2} }\\
\end{aligned}
$$

Here, the derivative has the property of **linearity**, and in particular, the derivative of the sum is obtained using the fact that it is the sum of derivatives.

$$
\dfrac{\partial}{\partial w} \mathcal{L}=\sum^{N}_{n=1}\dfrac {\partial }{\partial w}\left( t_{n}-wx_{n}\right)^{2}
$$

Where differentiation and summation $\sum$ The symbol of is replaced. Next, looking at the terms of the sum,

$$
\dfrac {\partial }{\partial w}\left( t_{n}-wx_{n}\right)^{2}
$$

Part of $t_n - wx_n$ You can see that it is a composite function of and its square .$u_{n} = t_{n} - wx_{n}$, $f(u_{n}) = u_{n}^{2}$ If you


$$
\begin{aligned}
\dfrac {\partial }{\partial w}\left( t_{n}-wx_{n}\right)^{2} &=  \dfrac {\partial }{\partial w} f(u_{n}) \\
&= \dfrac {\partial u_{n}}{\partial w} \dfrac{\partial f(u_{n})}{\partial u_{n}} \\
&=-x_{n} \times 2 u_{n}  \\
&= -2x_{n}( t_{n}-wx_{n} )
\end{aligned}
$$

can be obtained. Than this,



$$
\begin{aligned}
\dfrac{\partial }{\partial w} \mathcal{L}
&=\sum^{N}_{n=1}\dfrac {\partial }{\partial w}\left( t_{n}-wx_{n}\right)^{2}
\\&=-\sum^{N}_{n=1}2x_{n}\left( t_{n}-wx_{n}\right)
\end{aligned}
$$

It becomes. So that the value of this derivative is 0 $w$ When you ask for


$$
\begin{aligned}
\dfrac {\partial }{\partial w} \mathcal{L} &=0\\
-2\sum^{N}_{n=1}x_{n}\left( t_{n}-wx_{n}\right) &=0\\
-2 \sum^{N}_{n=1}x_{n}t_{n} + 2\sum^{N}_{n=1}wx^{2}_{n}&=0\\
-2\sum^{N}_{n=1}x_{n}t_{n}+2w\sum^{N}_{n=1}x^{2}_{n}&=0\\
w\sum^{N}_{n=1}x^{2}_{n}&=\sum^{N}_{n=1}x_{n}t_{n}\\
\end{aligned}
$$

Than,

$$
\begin{aligned}
w&=\dfrac {\displaystyle  \sum^{N}_{n=1}x_{n}t_{n}}{\displaystyle  \sum^{N}_{n=1}x^{2}_{n}}
\end{aligned}
$$

I asked for it. This parameter𝑤 Check the given data set $\mathcal{D} = \{x_n, t_n\}_{n=1}^{N}$ It can be seen that the decision can be made only from.


Next, the parameters in the numerical example given in the example 𝑤Let's ask for First, to centralize the data,

$$
\begin{aligned}
\bar{x} &= \dfrac{1}{3} (1 + 2 + 3) = 2 \\
\bar{t} &= \dfrac{1}{3}(2 + 3.9 + 6.1) = 4
\end{aligned}
$$

And averaging each of them, and subjecting each variable to preprocessing as central processing,

$$
\begin{aligned}
x_{1} &= 1 - 2 = -1 \\
x_{2} &= 2 -2 = 0 \\
x_{3} &= 3- 2 = 1\\
t_{1} &= 2 - 4 = -2\\
t_{2} &= 3.9 - 4 = -0.1\\
t_{3} &= 6.1 - 4 = 2.1 
\end{aligned}
$$

It becomes. And, using the values after centering, the optimal parameters $w$ If you derive



$$
\begin{aligned}
w &= \dfrac{\displaystyle \sum_{n=1}^{N}x_{n}t_{n}}{\displaystyle  \sum_{n=1}^{N}x_{n}^{2}} \\
&= \dfrac{x_{1}t_{1} + x_{2}t_{2} + x_{3}t_{3}}{x_{1}^{2} + x_{2}^{2} + x_{3}^{2}} \\
&= \dfrac{-1 \times (-2) + 0 \times 0.1 + 1 \times 2.1}{(-1)^{2} + 0^2 + 1^2} \\
&= 2.05
\end{aligned}
$$

I asked for it. This completes learning for simple regression analysis. The model using the obtained parameters is the trained **model**.

Next, let's use this model to make predictions for new samples. The process of calculating predicted values ​​for new input data using a learned model is called **inference** . For example, a new sample $x_{q}=1.5$  The predicted value for can be obtained as follows,

$$
\begin{aligned}
y_{c} &= wx_{c} \\
y_{q} - \bar{t} &= w(x_{q}-\bar{x}) \\
\Rightarrow y_{q} &= w(x_{q}-\bar{x}) + \bar{t} \\
&= 2.05 \times (1.5 - 2) + 4 \\
&= 2.975
\end{aligned}
$$

Since the model was trained using centralized data, it is important to remember that the actual predicted values ​​will be restored to the centralized data.

The above is a series of steps of single regression analysis.


## Multiple Regression analysis 
Next, we will deal with multiple regression analysis dealing with multivariable input variables. The knowledge of linear algebra will be deepened by learning this multiple regression analysis.

Multiple regression analysis is a type of supervised learning similar to simple regression analysis, and is a method for dealing with regression. The problem setting is almost the same as single regression analysis, but in multiple regression analysis, there are multiple input variables. In other words, it is a machine learning algorithm that can predict output variables from multiple input variables.

### Problem setting (multiple regression analysis) 

Here, considering the problem of predicting rent as in the case of simple regression analysis, rent is an output variable $y$ will do. As input variables, we will also take into consideration the distance from the station and the crime rate that could not be considered in single regression analysis. For example, the size of the room $x_{1}$, Distance from the station $x_{2}$, ..., crime rate $x_{M}$ like $M$ Suppose that there are $M=1$ In the case of, it is reduced to the problem of single regression analysis).

Similar to single regression analysis, learning is performed in the following three steps.

- Determine the model
- Determine the objective function
- Find the optimal parameter


ここでは単回帰分析の場合と同様に家賃を予測する問題を考え，家賃を出力変数 $y$ とします．入力変数としては，単回帰分析では考慮しきれていなかった駅からの距離や犯罪発生率なども考慮していきます．例えば，部屋の広さ $x_{1}$, 駅からの距離 $x_{2}$, ..., 犯罪発生率 $x_{M}$ のように $M$ 個の入力変数があるとします（$M=1$の場合，単回帰分析の問題に帰着されます）．

単回帰分析と同様，以下の3ステップで学習していきます．

- モデルを決める
- 目的関数を決める
- 最適なパラメータを求める

### Step 1. Determine the model (multiple regression analysis) 
The model for single regression analysis is

$$
y = wx + b
$$

And,$w$ The weight, $b$ Was called a bias. In multiple regression analysis, this expression is expanded to multiple input variables,

$$
y=w_{1}x_{1}+w_{2}x_{2}+\ldots +w_{M}x_{M}+b
$$

**Expressed** in the form of **linear combination** like . In this case, it is assumed that each input variable linearly affects the output variable, which is a fairly simple modeling. In practice, if there is a non-linear dependency between input variables, you need to model it taking that into consideration. I will explain it in the future.

The model of multiple regression analysis can be organized using the symbol of summation,
$$
y = \sum_{m=1}^{M} w_{m} x_{m} + b
$$

Can be written as And here, $x_0 = 1$，$w_0 = b$ If you

$$
\begin{aligned}
y&=w_{1}x_{1}+w_{2}x_{2}+\ldots +w_{M}x_{M}+b\\
&=w_{1}x_{1}+w_{2}x_{2}+\ldots +w_{M}x_{M}+w_{0} x_{0}\\
&=w_{0}x_{0}+w_{1}x_{1}+\ldots +w_{M}x_{M}\\
\end{aligned}
$$

Like bias $b$  Can be included in the summation. And as we sort out this formula,

$$
\begin{aligned}
y&=w_{0}x_{0}+w_{1}x_{1}+\ldots +w_{M}x_{M}\\
&=\begin{bmatrix}
w_{0} & w_{1} & \ldots  & w_{M}
\end{bmatrix}\begin{bmatrix}
x_{0} \\
x_{1} \\
\vdots  \\
x_{M}
\end{bmatrix}\\
&=\boldsymbol{w}^{T}\boldsymbol{x}
\end{aligned}
$$

It can be expressed as the inner product of vectors, like. Also, when handling in the future, $\boldsymbol{x}$ Is more computationally convenient to come in front of


$$
\begin{aligned}
y&=w_{0}x_{0}+w_{1}x_{1}+\ldots +w_{M}x_{M}\\
&=\begin{bmatrix}
x_{0} & x_{1} & \ldots  & x_{M}
\end{bmatrix}\begin{bmatrix}
w_{0} \\
w_{1} \\
\vdots  \\
w_{M}
\end{bmatrix}\\
&=\boldsymbol{x}^{T}\boldsymbol{w}
\end{aligned}
$$

Express as. This is a model of multiple regression analysis. This time as a parameter $M+1$ Weight $\boldsymbol{w}$ I will seek for


### Step 2. Determine the objective function (multiple regression analysis) 

In single regression analysis, teacher data $t$ And predicted value $y$ The smaller the squared error of, the better the prediction, and the sum is defined as the objective function. Predicted value even in multiple regression analysis $y$ Since asking for is the same, we use the same objective function as）


$$
\begin{aligned}
\mathcal{L}&=\left( t_{1}-y_{1}\right)^{2}+\left( t_{2}-y_{2}\right)^{2}+\ldots + \left( t_{N}-y_{N}\right)^{2}
\end{aligned}
$$

In this way, **the sum of squared errors** is adopted as an objective function as in single regression analysis. In single regression analysis, this is

$$
\mathcal{L}=\sum^{N}_{n=1} ( t_{n} - y_{n})^{2}
$$

As in, it was summarized using the symbol of the sum,


$$
\begin{aligned}
\mathcal{L}&=\left( t_{1}-y_{1}\right)^{2}+\left( t_{2}-y_{2}\right)^{2}+\ldots + \left( t_{N}-y_{N}\right)^{2}\\
&=\begin{bmatrix} t_{1} - y_{1} & t_{2}-y_{2} & \ldots & t_{N}-y_{N} \end{bmatrix} \begin{bmatrix}
t_{1}-y_{1} \\
t_{2}-y_{2} \\
\vdots \\
t_{N}-y_{N}
\end{bmatrix}\\
&=\left( \boldsymbol{t}-\boldsymbol{y}\right)^{T}\left( \boldsymbol{t}-\boldsymbol{y}\right) 
\end{aligned}
$$

It can also be expressed using a vector like. Also, $\boldsymbol{y}$ With regard to


$$
\begin{aligned}
\boldsymbol{y}=\begin{bmatrix}
y_{1} \\
y_{2} \\
\vdots \\
y_{N}
\end{bmatrix}=\begin{bmatrix}
\boldsymbol{x}_{1}^{T}\boldsymbol{w} \\
\boldsymbol{x}_{2}^{T}\boldsymbol{w} \\
\vdots  \\
\boldsymbol{x}_{N}^{T}\boldsymbol{w}
\end{bmatrix}
=\begin{bmatrix}
\boldsymbol{x}_{1}^{T} \\
\boldsymbol{x}_{2}^{T} \\
\vdots  \\
\boldsymbol{x}_{N}^{T}
\end{bmatrix}
\boldsymbol{w}
\end{aligned}
$$

Can be written as Organizing this,


$$
\begin{aligned}
\boldsymbol{y}&=
\begin{bmatrix}
x_{10} & x_{11} & x_{12} & \ldots  & x_{1M} \\
x_{20} & x_{21} & x_{22} & \ldots  & x_{2M} \\
\vdots  & \vdots  & \vdots  & \ddots  \\
x_{N0} & x_{N1} & x_{N{2}} & \ldots  & x_{NM}
\end{bmatrix}\begin{bmatrix}
w_{0} \\
w_{1} \\
w_{2} \\
\vdots  \\
w_{M}
\end{bmatrix}\\
&=\boldsymbol{X}\boldsymbol{w}
\end{aligned}
$$

It can be written as Here, the row (horizontal) direction represents a sample, for example, corresponding to each property. The column (vertical) direction represents the input variable, for example, the size of the room or the distance from the station. In terms of a little more concrete numbers, the size of the room $= 50m^{2}$ , Distance from the station $= 600 m$ , Crime rate $= 2$% like $n$ For the second property, the number of input variables $M=3$ And,


$$
\boldsymbol{x}_{n}^{T} = \begin{bmatrix}
1 & 50 & 600 & 0.02
\end{bmatrix}
$$

It is an image where data is stored in the row direction like. At the beginning $1$ Is used to include bias $x_{0}$ Please note that is

### Step 3. Optimize the parameter (multiple regression analysis) 

Then, model parameters to minimize the objective function of Step 2 $\boldsymbol{w}$ Let's ask for

※ **Here we will find analytical solutions of the optimal parameters while making full use of equation transformation, but the derivation process is a bit more complicated, and the results are shown in the next section (2.3 2.3), so we are interested If you don't, skip to the next section**.

First of all, regarding the objective function, $w$ If you change the expression so that it can be expressed by,


$$
\begin{aligned}
\mathcal{L}&=\left( \boldsymbol{t}-\boldsymbol{y}\right)^{T}\left( \boldsymbol{t}-\boldsymbol{y}\right) \\
&=\left( \boldsymbol{t}-\boldsymbol{X}\boldsymbol{w}\right)^{T}\left( \boldsymbol{t}-\boldsymbol{X}\boldsymbol{w}\right) \\
&= \left\{ \boldsymbol{t}^{T}-(\boldsymbol{X}\boldsymbol{w})^{T}\right\}\left( \boldsymbol{t}-\boldsymbol{X}\boldsymbol{w}\right) \\
&=\left( \boldsymbol{t}^{T}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\right)\left( \boldsymbol{t}-\boldsymbol{X}\boldsymbol{w}\right)
\end{aligned}
$$

It becomes. Where the transposition formula$(\boldsymbol{A}\boldsymbol{B})^{T} = \boldsymbol{B}^{T}\boldsymbol{A}^{T}$ Note that we are using Furthermore, if development is advanced using the distribution law,


$$
\begin{aligned}
\mathcal{L}&=\boldsymbol{t}^{T}\boldsymbol{t}-\boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{t} + \boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}\\
\end{aligned}
$$

It becomes. Parameters for this objective function $w$ We want to take a partial derivative of, but before that we can organize this expression a bit more. First,

$$
(1)^T = 1
$$

Thus, scalars do not change even if they are transposed. Come out in the above formula $\boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w}$ Is a scalar, so


$$
(\boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w})^{T} = \boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w}
$$

Is true. Furthermore, the transposition formula $(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C})^T = \boldsymbol{C}^T\boldsymbol{B}^T\boldsymbol{A}^T$ Than,


$$
(\boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w})^T = \boldsymbol{w}^{T} \boldsymbol{X}^{T} \boldsymbol{t}
$$

Is also true. Than this,

$$
(\boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w})^{T} = \boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w} = \boldsymbol{w}^{T} \boldsymbol{X}^{T} \boldsymbol{t}
$$

Can lead. Objective function $\mathcal{L}$ If you use the above equation,

$$
\begin{aligned}
\mathcal{L}=\boldsymbol{t}^{T}\boldsymbol{t}-2\boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w} + \boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}\\
\end{aligned}
$$
Can be summarized. here, $\boldsymbol{w}$ To do partial differentiation with respect to $\boldsymbol{w}$ If you put together the constant terms other than

$$
\begin{aligned}
\mathcal{L}&=\boldsymbol{t}^{T}\boldsymbol{t}-2\boldsymbol{t}^{T}\boldsymbol{X}\boldsymbol{w}+\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}\\
&=\boldsymbol{t}^{T}\boldsymbol{t}-2\left( \boldsymbol{X}^{T}\boldsymbol{t}\right)^{T} \boldsymbol{w}+\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w} \\
&= \gamma + \boldsymbol{\beta}^{T}\boldsymbol{w} + \boldsymbol{w}^{T}\boldsymbol{A}\boldsymbol{w} 
\end{aligned}
$$


Learned with linear algebra, like $\boldsymbol{w}$ It was able to express in quadratic form (quadric function) about. here,$\boldsymbol{A}= \boldsymbol{X}^{T}\boldsymbol{X},, \ \boldsymbol{\beta} =-2 \boldsymbol{X}^{T}\boldsymbol{t}, , \ \gamma = \boldsymbol{t}^{T}\boldsymbol{t}$ And,$\boldsymbol{\beta}$ The reason for having transposed form is to devise a vector learned by linear algebra to conform to the form of derivative formula.

Then, the parameter which can minimize the objective function $\boldsymbol{w}$ Let's consider how to ask for As mentioned earlier, the objective function is a parameter $\boldsymbol{w}$ It is a quadratic function with respect to. For example,


$$
\begin{aligned}
\boldsymbol{w} = \begin{bmatrix}
w_{1} \\ w_{2}
\end{bmatrix}, 
\boldsymbol{A}=\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix},\boldsymbol{\beta}=\begin{bmatrix}
1 \\
2
\end{bmatrix}, \gamma = 1 
\end{aligned} 
$$ 

If you think with concrete numerical examples like,

$$
\begin{aligned} 
\mathcal{L} & = 
\boldsymbol{w}^{T}\boldsymbol{A}\boldsymbol{w} + \boldsymbol{\beta}^{T}\boldsymbol{w} + \gamma \\ 
&=
\begin{bmatrix}
w_{1} & w_{2}
\end{bmatrix}\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}\begin{bmatrix}
w_{1} \\
w_{2}
\end{bmatrix}
+\begin{bmatrix}
1 & 2
\end{bmatrix} \begin{bmatrix} 
w_{1} \\ 
w_{2} 
\end{bmatrix} + 1 \\ 
&=
\begin{bmatrix} 
w_{1} & w_{2} 
\end{bmatrix} 
\begin{bmatrix} 
w_{1} + 2w_{2} \\ 
3w_{1} + 4w_{2} 
\end{bmatrix} + w_{1} + 2w_{2} + 1 \\ 
&=w_{1}\left( w_{1} + 2w_{2}\right) + w_{2}\left( 3w_{1} + 4w_{2}\right) + w_{1} + 2w_{2} + 1 \\ 
&=w^{2}_{1} + 5w_{1}w_{2} + 4w^{2}_{2} + w_{1} + 2w_{2}+1 \\ 
\end{aligned}
$$

And $w_{1}, w_{2}$ In terms of


$$
\begin{aligned}
\mathcal{L}
&= w^{2}_{1} + \left( 5w_{2} + 1\right) w_{1} + 
\left( 4w^{2}_{2}+2w_{2}+1\right) \\ 
&=4w^{2}_{2} + \left(5w_{1} + 2\right) w_{2} + \left( w^{2}_{1} + w_{1} + 1\right) \end{aligned} 
$$

You can see that it is each quadratic function like.

And if it is a quadratic function, it looks like the figure below.

![パラメータと目的関数の関係（2次元）](https://raw.githubusercontent.com/preferred-medicine/medical-ai-course-materials/master/notebooks/images/2/06.png)

If this is imaged in three dimensions, it will become like the following figure.



![パラメータと目的関数の関係（3次元）](https://raw.githubusercontent.com/preferred-medicine/medical-ai-course-materials/master/notebooks/images/2/08.png)

Then, at the point where the sum of squared errors, which is the objective function, is minimum, the slope when differentiated with each variable is zero.


![目的関数が最小となる点](https://raw.githubusercontent.com/preferred-medicine/medical-ai-course-materials/master/notebooks/images/2/07.png)

In this example$w_{1}$ when $w_{2}$ I considered in the case of two parameters of $w_{0}$, $w_{1}$, $w_{2}$, $\ldots$, $w_{M}$ You can think in the same way for


$$
\begin{cases}
\dfrac {\partial }{\partial w_{0}}\mathcal{L}=0\\
\dfrac {\partial }{\partial w_{1}}\mathcal{L}=0\\
\ \ \ \ \ \vdots \\
\dfrac {\partial }{\partial w_{M}}\mathcal{L}=0\\
\end{cases}
$$

And put it together,

$$
\begin{aligned}
\begin{bmatrix}
\dfrac {\partial}{\partial w_{0}} \mathcal{L} \\
\dfrac {\partial}{\partial w_{1}} \mathcal{L} \\
\vdots  \\
\dfrac {\partial}{\partial w_{M}} \mathcal{L} \\
\end{bmatrix}&=\begin{bmatrix}
0 \\
0 \\
\vdots  \\
0 \\
\end{bmatrix} \\
\Rightarrow \dfrac {\partial}{\partial \boldsymbol{w}} \mathcal{L} &= \boldsymbol{0} \\
\end{aligned}
$$

It is expressed as a derivative of a vector like. After that, to satisfy the above equation $\boldsymbol{w}$ We will decide First of all𝑤Perform substitution and formula transformation to make it easier to find. (Because the following calculations use the contents learned through linear algebra, including differentiation with vectors, so if you do not know the middle of the calculation, please proceed while checking the parts of linear algebra. )

$$
\begin{aligned}
\dfrac {\partial }{\partial \boldsymbol{w}}\mathcal{L} =\dfrac {\partial }{\partial \boldsymbol{w}}\left( \gamma + \boldsymbol{\beta}^{T}\boldsymbol{w} + \boldsymbol{w}^{T}\boldsymbol{A}\boldsymbol{w}\right)
= \boldsymbol{0}\\
\dfrac {\partial }{\partial \boldsymbol{w}}\left( \gamma\right) +\dfrac {\partial }{\partial \boldsymbol{w}}\left( \boldsymbol{\beta}^{T}\boldsymbol{w}\right) +\dfrac {\partial }{\partial \boldsymbol{w}}\left( \boldsymbol{w}^{T}\boldsymbol{A}\boldsymbol{w}\right) 
=\boldsymbol{0}\\
\boldsymbol{0}+\boldsymbol{\beta}+\left( \boldsymbol{A}+\boldsymbol{A}^{T}\right) \boldsymbol{w} =\boldsymbol{0}\\
-2\boldsymbol{X}^{T}\boldsymbol{t}+\left\{ \boldsymbol{X}^{T}\boldsymbol{X} + \left( \boldsymbol{X}^{T}\boldsymbol{X}\right)^{T}\right\} \boldsymbol{w}
=\boldsymbol{0}\\
-2\boldsymbol{X}^{T}\boldsymbol{t}+2\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}=\boldsymbol{0}\\
\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}=\boldsymbol{X}^{T}\boldsymbol{t}\\
\end{aligned}
$$

here,$\boldsymbol{X}^{T} \boldsymbol{X}$ From the left side on both sides, assuming that the inverse matrix exists in $\left( \boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}$ If you 


$$
\begin{aligned}
\left( \boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{X} \boldsymbol{w} =\left( \boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{t} \\
\boldsymbol{I}\boldsymbol{w}=\left( \boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{t} \\
\boldsymbol{w}=\left( \boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{t}
\end{aligned}
$$


And the given data set $\boldsymbol{X}, \boldsymbol{t}$ From the optimum parameters $\boldsymbol{w}$ Was asked. here, $\boldsymbol{I}$ Is the identity matrix. Also, at the time of equation transformation,

$$
\boldsymbol{w} = \dfrac{\boldsymbol{X}^{T}\boldsymbol{t}}{\boldsymbol{X}^{T}\boldsymbol{X}}
$$

Be careful not to show fractions like. This is because there is no division in matrix calculations. Therefore, it is calculated only by matrix multiplication using inverse matrix.

Also, as another common mistake,$\boldsymbol{w}$ Here is an example that transforms the equation to find.


$$
\begin{aligned}
\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}&=\boldsymbol{X}^{T}\boldsymbol{t}\\
\left( \boldsymbol{X}^{T}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}&=\left( \boldsymbol{X}^{T}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{t}\\
\boldsymbol{X}\boldsymbol{w}&=\boldsymbol{t}\\
\boldsymbol{X}^{-1}\boldsymbol{X}\boldsymbol{w}&=\boldsymbol{X}^{-1}\boldsymbol{t}\\
\boldsymbol{w}&=\boldsymbol{X}^{-1}\boldsymbol{t}
\end{aligned}
$$

However, this generally does not hold. The reason is that the square matrix is not satisfied as a condition to have the inverse matrix . Generally, the number of samples $N$  And the number of input variables $M+1$ Are not equal,$\boldsymbol{X} \in \mathcal{R}^{N \times (M+1)}$)Is not square and has no inverse. For it, $\boldsymbol{X}^{T} \boldsymbol{X}$ Is $\boldsymbol{X}^{T} \boldsymbol{X}$ は $\boldsymbol{X}^{T}\boldsymbol{X} \in \mathcal{R}^{(M+1) \times (M+1)}$) And the number of samples $N$ It is always square matrix without depending on. (There are more strict conditions for finding the inverse matrix, but I will not explain it here.)

Parameters derived from learning during inference $\boldsymbol{w}$ Using,


$$
y_{q} = \boldsymbol{w}^{T}\boldsymbol{x}_{q}
$$

You can get the predicted value by calculating as follows.

## Implementation by Numpy

Let's use Python to implement linear algebra, using multiple regression analysis as an example. Python has a library called **NumPy** that can easily handle linear algebra and is widely used. NumPy is frequently used among the Chainers introduced in the next chapter, and it is important to learn how to use NumPy as a first step to learning deep learning.

It is assumed that you know the syntax of Python. Specifically, you need to understand variables (numbers, strings, lists, tuples, dictionaries), control syntax (for, if), functions, and classes.

In the multiple regression analysis, finally the optimal parameters $\boldsymbol{w}$ But


$$
\boldsymbol{w}=\left( \boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{t}
$$



It turned out that it is represented by. We will deal with the following five to calculate this optimal parameter.

- Definition of vector
- Matrix definition
- Transposition
- Matrix product
- Inverse matrix

Specifically, let's assume the case where the following data set is given. In this example, the number of data samples $N$ Is $4$ And the input data $X$ The number of variables of $2$ is. And $t$ Is teacher data.


$$
\boldsymbol{X} = 
\begin{bmatrix}
1 & 2 & 3 \\
1 & 2 & 5 \\
1 & 3 & 4 \\  
1 & 5 & 9 
\end{bmatrix}, \
\boldsymbol{t} = 
\begin{bmatrix}
1 \\ 5 \\ 6 \\ 8
\end{bmatrix}
$$

here $\boldsymbol{X}$ Is the parameter $\boldsymbol{w}$ Is biased 𝑏 Assumes a form that includes𝑋In the first column of1Is stored.

Let's look at the implementation method. First we will start by reading NumPy.

here $\boldsymbol{X}$ Is the parameter $\boldsymbol{w}$ Is biased $\boldsymbol{b}$ Assumes a form that includes $\boldsymbol{X}$ In the first column of $1$ Is stored.

Let's look at the implementation method. First we will start by reading NumPy.


In [0]:
import numpy as np

The definition of the vector is as follows.

In [0]:
t = np.array([1, 5, 6, 8])

Let's display the vector.

In [0]:
print(t)

[1 5 6 8]


Let's define the matrix and display it.



In [0]:
X = np.array([
    [1, 2, 3],
    [1, 2, 5],
    [1, 3, 4],
    [1, 5, 9]
])

In [0]:
print(X)

[[1 2 3]
 [1 2 5]
 [1 3 4]
 [1 5 9]]


Here we use `np.array` the function to `np.ndarray`convert from a Python list to the NumPy multidimensional array form ( ).

Next, let's transpose X. `np.ndarray`If it is defined in `.T`, you can transpose just by putting it on.


In [0]:
print(X.T)

[[1 1 1 1]
 [2 2 3 5]
 [3 5 4 9]]


You can see that the vertical and horizontal are switched.

Matrix multiplication `np.dot`can be realized as follows. When doing matrix multiplication, be aware that the number of columns in the first matrix is the same as the number of rows in the second matrix.


In [0]:
XX = np.dot(X.T, X)

In [0]:
print(XX)

[[  4  12  21]
 [ 12  42  73]
 [ 21  73 131]]


Further from here,$\boldsymbol{X}^{T}\boldsymbol{X}$ Inverse matrix for,$\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}$Calculate. Use to get the inverse matrix `np.linalg.inv`.

In [0]:
XX_inv = np.linalg.inv(XX)

In [0]:
print(XX_inv)

[[ 1.76530612 -0.39795918 -0.06122449]
 [-0.39795918  0.84693878 -0.40816327]
 [-0.06122449 -0.40816327  0.24489796]]


You now have the necessary operations for multiple regression analysis.

Optimal parameter$\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{t}$ If you ask for


In [0]:
Xt = np.dot(X.T, t)

In [0]:
print(Xt)

[ 20  70 124]


In [0]:
w = np.dot(XX_inv, Xt)

In [0]:
print(w)

[-0.14285714  0.71428571  0.57142857]


Parameters like this  $\boldsymbol{w}$ Was asked. By using NumPy, you can write mathematical expressions as they are on a program.


## Scikit-learn the execution of the machine learning algorithm by 


Multiple regression analysis was relatively easy to implement with NumPy, but many of the practical machine learning algorithms are complex and often difficult for beginners to write from scratch. Therefore, a framework for machine learning called **Scikit-learn** is disclosed in Python, and even **beginners** can easily handle various machine learning algorithms.

Here, we will introduce **the implementation method** of multiple regression analysis **using Scikit-learn** . The data set is the same as before $\boldsymbol{X}$ when $\boldsymbol{t}$ In Scikit-learn, **parameters** are used. $\boldsymbol{w}$ **Is biased** $\boldsymbol{b}$ **Assumes a** format that **does not include** $\boldsymbol{X}$ From the first column of1It is common to remove Therefore,

$$
\boldsymbol{X} = 
\begin{bmatrix}
2 & 3 \\
2 & 5 \\
3 & 4 \\  
5 & 9 
\end{bmatrix}, \
\boldsymbol{t} = 
\begin{bmatrix}
1 \\ 5 \\ 6 \\ 8
\end{bmatrix}
$$

Suppose that is given.



### Scikit-learn 基礎編
`sklearn` You can call Scikit-learn with the name:

In [0]:
import sklearn

When using multiple regression analysis, call as follows.

In [0]:
from sklearn.linear_model import LinearRegression

In addition to the  [official reference](http://scikit-learn.org/), it is also useful to look at the actual code example when investigating how to use (For example, if you search using a keyword such as "multiple regression analysis Scikit-learn" in a search engine, many codes An example is found).

The algorithm of multiple regression analysis is defined as a class, and it needs to be instantiated to use the actual model. Instantiation `()` can be done by appending the class name .

In [0]:
model = LinearRegression()

That's it, you are ready to use multiple regression analysis. Using this model, learning of parameters is performed as follows.

In [0]:
X = np.array([
    [2, 3],
    [2, 5],
    [3, 4],
    [5, 9]
])
t = np.array([1, 5, 6, 8])

model.fit(X, t)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

The verification of the result is performed as follows.

In [0]:
model.score(X, t)

0.6923076923076923

In the case of regression, an index called the **coefficient of determination**, which is expressed by the following equation, is automatically calculated.

$$
R^{2} = 1 - \dfrac{\sum_{i}\left( t_{i} - y_{i} \right)^{2}}{\sum_{i}\left( t_{i} - \bar{t} \right)^{2}}
$$

In this way, Scikit-learn allows you to communicate with a simple interface. The good point of Scikit-learn is that any algorithm can be verified `.fit()` by learning, once the algorithm is decided first `.score()`.

Also, although the contents differ depending on the algorithm, the parameters are also stored as instance variables, so they can be checked after learning.

In [0]:
# パラメータw
model.coef_

array([0.71428571, 0.57142857])

In [0]:
# バイアスb
model.intercept_

-0.14285714285714501

###  Scikit-learn Applied 
Scikit-learn has many features to help implement machine learning. This section introduces how to use sample data sets and how to divide them.


####  The use of sample data set 
First we will introduce the handling of sample data sets. Several data sets are provided in Scikit-learn. This time, I will use the data set of property prices by region in the suburbs of Boston, USA.

In this data set $506$ Data is registered, and the average property price of the target area in each sample and the average property information of the target area as information linked to it (the number of rooms per unit, age, distance from the employment facility, etc. This includes demographic information (proportion of low-income earners, number of students per teacher, etc.), information on living environment (such as crime incidence), etc. The purpose of using this data set is to build a model that predicts the average property price, which is an output variable, using information such as property and demographics as an input variable. There are 13 kinds of input variables in total, and the details are as follows.

* CRIM: Population1Per capita crime rate
* ZN: 25,000Percentage of residential areas over square feet
* INDUS: Percentage of area occupied by non-retail industry
* CHAS: Dummy variables on the Charles River (1: Along the river, 0: otherwise)
* NOX: concentration of nitrogen oxides
* RM: Average number of rooms per residence
* AGE: Percentage of properties built before 1940
* DIS: Weighted distance from five Boston employment facilities
* RAD: Access Index to Urban Main Roads
* TAX: \$ $10,000Property tax rate per
* PTRATIO: Number of students per teacher
* B: Index that represents the proportion of blacks
* LSTAT: Percentage of low-income people

`load_boston()` Let's execute the function and read the data set.

In [0]:
from sklearn.datasets import load_boston

In [0]:
boston = load_boston()

Variables are `boston`stored in dictionary form, and while looking at the contents of variables, we will find the ones corresponding to the input data and the output data. This time is the `data` input, and `target`corresponds to the output.

In [0]:
X = boston['data']
t = boston['target']

In [0]:
print(X)

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]


In [0]:
print(t)

[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 25.3 24.7 21.2 19.3 20.  16.6 14.4 19.4 19.7 20.5 25.  23.4 18.9 35.4
 24.7 31.6 23.3 19.6 18.7 16.  22.2 25.  33.  23.5 19.4 22.  17.4 20.9
 24.2 21.7 22.8 23.4 24.1 21.4 20.  20.8 21.2 20.3 28.  23.9 24.8 22.9
 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22.  22.9 25.  20.6 28.4 21.4 38.7
 43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8
 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22.  20.3 20.5 17.3 18.8 21.4
 15.7 16.2 18.  14.3 19.2 19.6 23.  18.4 15.6 18.1 17.4 17.1 13.3 17.8
 14.  14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4
 17.  15.6 13.1 41.3 24.3 23.3 27.  50.  50.  50.  22.7 25.  50.  23.8
 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2
 37.9 32.5 26.4 29.6 50.  32.  29.8 34.9 37.  30.5 36.4 31.1 29.1 50.
 33.3 3

Input data and teacher data are stored in the form of NumPy, `.shape` and you can check the number of rows and columns by using.

In [0]:
X.shape

(506, 13)

In [0]:
t.shape

(506,)

Array of input data $X$ To $506$ Data for the case is stored. Each sample is $13$ Expressed as a dimensional vector, each of which is $13$ Represents a kind of input variable. Teacher data $t$ The scalar value of the average property price is stored as the output variable corresponding to the input variable in.

#### 2.4.2.2. Of the data set division 
Next, I will introduce how to divide this learning data into **training data** and **test data** . If the performance of the model is evaluated using the data used at the time of learning, even if the performance of the learning data is high, the unknown data (taken from the same distribution) has not been seen during learning. There is a case. This **over-learning** is called. In order to prevent this, machine learning separates and evaluates test data for evaluating performance separately from learning data. This separation and verification is called **holdout method**.

Scikit-learn provides functions to divide training and testing.

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.3, random_state=0)

In [0]:
X_train.shape

(354, 13)

In [0]:
X_test.shape

(152, 13)

`train_test_split()` The argument `test_size` of the function is the ratio of data used for verification,$0.3$ And the entire $30$% Is the test data. Also, `random_state` is a random number seed, and given a fixed seed value, you can ensure the repeatability of the division. Why do random numbers appear? $70$% Randomly selected from the whole, not for training and the rest for testing $70$% For training, the rest $30$% This is because% is selected for testing.

Then, learning is performed using training data.

In [0]:
model = LinearRegression()

In [0]:
model.fit(X_train, t_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

If you do verification, it is better to check both training data and test data.

In [0]:
# 訓練データ
model.score(X_train, t_train)

0.7644563391821222

In [0]:
# テストデータ
model.score(X_test, t_test)

0.673528086534723

By examining not only test data but also training data, you can isolate problems when learning fails.

**Underfitting** is a condition in which the model can not predict the training data with good accuracy . If underfitting is occurring, it is considered that the current machine learning algorithm does not capture the features of the data well, and change the algorithm or think of a transformation that can represent the features of the input data more appropriately I will try to improve it. Conversely, in the case of **overfitting (overlearning)** , it is confirmed that the features of the data are captured to some extent by the algorithm, so we will take measures to prevent the model from overlearning. As a typical method, it can be solved by adjusting the parameter value used for parameter learning of each algorithm called **hyper parameter** . Thus, even if the desired results are not obtained, it is important to verify both the training data and the test data, because the measures to be taken next will change as the situation is grasped. I understand.

You can also do scaling with Scikit-learn. For example, the procedure to perform data normalization to convert to mean 0, standard deviation 1 is as follows.

In [0]:
from sklearn.preprocessing import StandardScaler

In [0]:
# インスタンス化
scaler = StandardScaler()

Calculate mean and variance using training data.

In [0]:
# 平均と分散を計算
scaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

Scale training data and test data using the calculated mean and variance.


In [0]:
# 変換
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)

Note that we also use the mean and variance of training data when scaling test data. Since test data is an unknown data set for a model, using the average and variance of all the data combining training data and test data will give information to the model of test data that can not be originally known. I will. Therefore, scaling is performed using only training data for which models are available.

Since the mean and variance of training data and test data are different, the average of test data scaled by the mean and variance of training data is the average0, Variance1Please note that it does not necessarily become.

In [0]:
print(X_train_s)

[[-0.20416267 -0.49997924  1.54801583 ...  1.2272573   0.42454294
   3.10807269]
 [-0.38584317  0.34677427 -0.58974728 ...  0.05696346  0.40185312
  -0.66643035]
 [-0.33266283 -0.49997924  1.54801583 ...  1.2272573   0.39846135
   0.63936662]
 ...
 [-0.38147768 -0.49997924 -0.15303077 ... -0.30312696  0.39659002
  -0.30284441]
 [-0.3720831  -0.49997924 -0.59690657 ... -0.25811566  0.37588849
   0.89967717]
 [-0.38289844 -0.49997924 -1.00641779 ... -0.84326258  0.42454294
   0.31822262]]


In [0]:
print(X_test_s)

[[-0.39152624 -0.49997924 -1.12239824 ... -0.70822867  0.17086147
  -0.72160487]
 [ 0.70825498 -0.49997924  1.00534187 ...  0.77714428  0.0648977
  -0.41177872]
 [-0.38588517 -0.49997924  0.4025299  ... -0.93328518  0.38758427
  -0.27454978]
 ...
 [ 1.6177735  -0.49997924  1.00534187 ...  0.77714428  0.42454294
   2.59876943]
 [-0.34043865 -0.49997924 -0.1687812  ... -0.03305915  0.42454294
  -1.11772962]
 [-0.39601293 -0.49997924 -1.27417512 ...  0.10197476  0.39202867
  -1.02294263]]


In addition, Scikit-learn supports various machine learning algorithms such as logistic regression, support vector machines, and random forests.

For these as well, as with multiple regression analysis, a model is instantiated, and learning data is used as an argument for .`.fit()` training with a `.score()` function, and can be evaluated using a function.

Please refer to [Scikit-learn](https://scikit-learn.org/) site and commentary site etc. for more detailed information.