<a href="https://colab.research.google.com/github/pokablive/datasi/blob/main/DS23_preprocessing_%E9%85%8D%E5%B8%83%E7%94%A8_%E3%81%AE%E3%82%B3%E3%83%94%E3%83%BC_%E3%81%AE%E3%82%B3%E3%83%94%E3%83%BC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

パッケージのインストール

In [None]:
%pip install -U typing-extensions

Collecting typing-extensions
  Downloading typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.5.0
    Uninstalling typing_extensions-4.5.0:
      Successfully uninstalled typing_extensions-4.5.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.8.0 which is incompatible.[0m[31m
[0mSuccessfully installed typing-extensions-4.8.0


In [None]:
%pip install jaxtyping

Collecting jaxtyping
  Downloading jaxtyping-0.2.22-py3-none-any.whl (25 kB)
Collecting typeguard>=2.13.3 (from jaxtyping)
  Downloading typeguard-4.1.5-py3-none-any.whl (34 kB)
Installing collected packages: typeguard, jaxtyping
Successfully installed jaxtyping-0.2.22 typeguard-4.1.5


パッケージのimport

In [None]:
from typing import Sequence, Any
from jaxtyping import Array, Float, Num
from typing_extensions import Self

import numpy as np
from sklearn.utils.validation import check_is_fitted
from sklearn.base import OneToOneFeatureMixin, TransformerMixin, BaseEstimator

from sklearn.datasets import load_iris, load_wine
from sklearn.model_selection import train_test_split

## 正規化

正規化とは以下の処理を特徴量ごとに行う，データ解析の前処理のこと．

$$
X_\text{normal} =\frac{X-X_{min}}{X_{max}-X_{min}}
$$

### sklearn-like apiによる正規化の実装

In [None]:
class MinMaxScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
    def __init__(self, in_features=None):
        self.in_features = in_features

    def fit(self, X, y=None):
        if self.in_features is None:
            self.in_features_ = np.ones(X.shape[1], dtype=np.bool_)
        else:
            self.in_features_ = self.in_features
        X = X[:,self.in_features_]
        self.min_ = X.min(axis=0)
        self.max_ = X.max(axis=0)
        return self

    def transform(self, X):
        check_is_fitted(self, "min_")
        X = X[:,self.in_features_]
        transformed = (X - self.min_) / (self.max_ - self.min_)
        return transformed

    def fit_transform(self, X):
        self.fit(X)
        tranformed = self.transform(X)
        return tranformed

type hintを使ったバージョン

In [None]:
class MinMaxScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
    def __init__(self, in_features: Sequence[int] | None =None)->None:
        self.in_features = in_features

    def fit(self, X: Num[Array, "data feature"], y=None)->Self:
        if self.in_features is None:
            self.in_features_ = np.ones(X.shape[1], dtype=np.bool_)
        else:
            self.in_features_ = self.in_features
        X = X[:,self.in_features_]
        self.min_ = X.min(axis=0)
        self.max_ = X.max(axis=0)
        return self

    def transform(self, X: Num[Array, "data feature"])->Num[Array, "data feature"]:
        check_is_fitted(self, "min_")
        X = X[:,self.in_features_]
        transformed = (X - self.min_) / (self.max_ - self.min_)
        return transformed

    def fit_transform(self, X: Num[Array, "data feature"])->Num[Array, "data feature"]:
        self.fit(X)
        tranformed = self.transform(X)
        return tranformed

In [None]:
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

In [None]:
MinMaxScaler().transform(X_test)

In [None]:
MinMaxScaler([0,1]).fit(X_train).transform(X_test)

## 標準化

標準化とは以下の処理を特徴量ごとに行う，データ解析の前処理のこと．


$$
X_{std} = \frac{X-\mu}{\sigma_X}
$$

### sklearn-like APIによる標準化の実装

MinMaxScalerクラスを参考に，標準化を実装せよ．

In [None]:
class MinMaxScaler(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):
    def __init__(self, in_features: Sequence[int] | None =None)->None:
        self.in_features = in_features

    def fit(self, X: Num[Array, "data feature"], y=None)->Self:
        if self.in_features is None:
            self.in_features_ = np.ones(X.shape[1], dtype=np.bool_)
        else:
            self.in_features_ = self.in_features
        X = X[:,self.in_features_]
        self.min_ = X.min(axis=0)
        self.max_ = X.max(axis=0)
        return self

    def transform(self, X: Num[Array, "data feature"])->Num[Array, "data feature"]:
        check_is_fitted(self, "min_")
        X = X[:,self.in_features_]
        transformed = (X - self.min_) / (self.max_ - self.min_)
        return transformed

    def fit_transform(self, X: Num[Array, "data feature"])->Num[Array, "data feature"]:
        self.fit(X)
        tranformed = self.transform(X)
        return tranformed

In [None]:
StandardScaler(ddof=1).fit(X_train).transform(X_test)

In [None]:
StandardScaler(ddof=0).fit(X_train).transform(X_test)

In [None]:
StandardScaler([0,1]).fit(X_train).transform(X_test)

In [None]:
StandardScaler([0,1]).transform(X_test)