# Feature Extraction - MDS 

알고리즘 출처 : 단단한 머신러닝 챕터 10 - 차원축도와 척도 학습 

**입력**
- 거리 행렬 $D \in R^{m*m}$, 원소 $dist_{ij}$는 샘플 $x_i$에서 $x_j$까지의 거리 
- 저차원 공간 차원수 d'

**과정**
1. $dist_{i.}^2, dist_{.j}^2, dist_{..}^2$을 계산

2. $dist$ 값을 기반으로 $b_{ij}$ 계산하기 

3. 행렬 B에 대해서 고윳값 분해 실행 

4. $\hat \wedge $d'개 최대 고윳값으로 구성된 대각 행렬로, $\hat V$에 상응하는 고유 벡터 행렬로 하여 값을 구한다. 

**출력**
- 행렬 $\hat \wedge \hat V^{\frac{1}{2}} \in R^{m * d'} $, 각 행은 한 샘플의 저차원 좌표




In [1]:
# 데이터 사용 및 라이브러리 설치 

import numpy as np
import pandas as pd 

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression 

boston = load_boston()
X = boston.data 

y = boston.target


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

### 함수 __init__ 설정 및 $dist_{i.}^2, dist_{.j}^2, dist_{..}^2$ 계산


**입력**
- 거리 행렬 $D \in R^{m*m}$, 원소 $dist_{ij}$는 샘플 $x_i$에서 $x_j$까지의 거리 
- 저차원 공간 차원수 d'

**과정**
1. $dist_{i.}^2, dist_{.j}^2, dist_{..}^2$을 계산


**구현해야하는 것**
- $dist_{i.}^2$ : $\frac {1}{m} \sum_{j=1}^m dist_{ij}^2$ 
- $dist_{.j}^2$ : $\frac {1}{m} \sum_{i=1}^m dist_{ij}^2$
- $dist_{..}^2$ : $\frac {1}{m^2} \sum_{i=1}^m \sum_{i=1}^m dist_{ij}^2$

- $dist_{ij}^2$ : $||z_i - z_j||^2$


**필요로 하는 것**
- X 
- $dist_{ij}^2$ metrix 

**함수의 형태**
- def dist(self, metrix) : => $dist_{ij}^2,  dist_{i.}^2,  dist_{.j}^2,  dist_{..}^2$


In [16]:
class MDS() : 
    def __init__(self, X): 
        self.X = X 
        self.n = np.shape(X)[0]
        self.m = np.shape(X)[1]
        
    def dist(self) : 
        dist_metrix = [] 
        for i in range(len(self.X)) : 
            origin = np.array(self.X[i])
            dist_metrix.append([np.dot(origin, np.array(x).T) for x in self.X])
        dist_metrix = np.array(dist_metrix)
        dist_i = dist_metrix.sum(axis=1)/self.n
        dist_j = dist_metrix.sum(axis=0)/self.n
        dist_all = dist_metrix.sum()/(self.n**2)
        
        return dist_metrix, dist_i, dist_j, dist_all

In [17]:
test = MDS(X)
np.shape(test.dist()[0])
np.shape(test.dist()[3])


()

### 2. $dist$ 값을 기반으로 $b_{ij}$ 계산하기 
### 3. 행렬 B에 대해서 고윳값 분해 실행 
### 4. $\hat \wedge $d'개 최대 고윳값으로 구성된 대각 행렬로, $\hat V$에 상응하는 고유 벡터 행렬로 하여 값을 구한다. 




**구현해야하는 것**
- $b_{ij}$ = -($dist_{ij}^2 - dist_{i.}^2 - dist_{.j}^2 + dist_{..}^2$)/2
- 고윳값 분해 


**필요로 하는 것**
- $dist_{ij}^2,  dist_{i.}^2, dist_{.j}^2 , dist_{..}^2$ 
- b metrix 

**함수의 형태**
- def b(self) : => b metrix 

- def eigen(self) : => eigenvalue digonal metrix, eigenvector metrix 

- def mds_goal(self, d) => $\hat \wedge^{\frac{1}{2}} \hat V^T \in R^{m * d'} $


In [57]:
class MDS() : 
    def __init__(self, X): 
        self.X = X 
        self.n = np.shape(X)[0]
        self.m = np.shape(X)[1]
        
    def dist(self) : 
        dist_metrix = [] 
        for i in range(len(self.X)) : 
            origin = np.array(self.X[i])
            dist_metrix.append([np.dot(origin, np.array(x).T) for x in self.X])
        dist_metrix = np.array(dist_metrix)
        dist_i = dist_metrix.sum(axis=1)/self.n
        dist_j = dist_metrix.sum(axis=0)/self.n
        dist_all = dist_metrix.sum()/(self.n**2)
        
        return dist_metrix, dist_i, dist_j, dist_all
    
    def b(self) : 
        metrix, row, column, sum_all = self.dist()
        return -(metrix - row.T - column + sum_all)/2 
    
    def eigen(self) : 
        metrix = self.b()
        eigenvector_lst= []
        _, eigenvalue, eigenvector = np.linalg.svd(metrix)
        index = np.argsort(eigenvalue)[::-1]
        for i, num in enumerate(index) : 
            eigenvector_lst.append(eigenvector[i, :])
        return np.take(eigenvalue, index), np.array(eigenvector_lst) 

    def mds_goal(self,d) : 
        eigenvalue, eigenvector = self.eigen()
        eigenvalue_d_sqrt = np.sqrt(eigenvalue[:d])
        eigenvector_d = eigenvector[:d, :]
        # eigenvector 들은 (m x 1) 배열의 형태를 띄므로, transpose를 안해야 한다.
        return np.dot(np.diag(eigenvalue_d_sqrt), eigenvector_d)


In [56]:
test = MDS(X)
test.mds_goal(3)


array([[ 33.52630565, -17.11741528, -21.91810401, ...,  14.96608094,
         12.87492661,  13.61477347],
       [168.79397584, 204.01264918, 209.79240386, ..., 181.16091483,
        184.10893025, 182.51571649],
       [-50.86493942, -50.16770559, -45.9569035 , ..., -50.82672536,
        -47.63050092, -50.52434811]])