# Build & Save Similarity Model
---

### 개요
* **Preprocessed_repository**로 부터 **preprocessing** 된 data를 불러와 각 data 사이 **유사도(similarity)**를 계산하여 하나의 **유사도 모델(similarity_model)**을 구성하여 반환/저장함

---
* 아래는 저장되어있는 preprocessed_data 사이 similarity를 계산하여 similarity_model을 구성/저장하는 과정임  

<img src="https://raw.githubusercontent.com/jhyun0919/EnergyData_jhyun/master/docs/images/%EC%8A%A4%ED%81%AC%EB%A6%B0%EC%83%B7%202016-05-18%20%EC%98%A4%EC%A0%84%2010.26.43.jpg" alt="Drawing" style="width: 700px;"/>

---
* similarity 계산과 save 과정에 필요한 module들을 import 하자

In [1]:
from utils import GlobalParameter
from utils import FileIO
from utils import Similarity
import os

---
* 다음 과정은 repository의 경로를 지정하고 확인하는 과정이다

In [2]:
repository4prepodessed_path = os.path.join(GlobalParameter.Repository_Path, 
                                           str(GlobalParameter.Time_Interval), 
                                           GlobalParameter.Fully_Preprocessed_Path)
repository4prepodessed_path

'/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined'

---
* 지정된 경로 아래에 있는 preprocessed_data file들의 abs_path를 list로 만들어 반환하자

In [3]:
file_list = FileIO.Load.binary_file_list(repository4prepodessed_path)
file_list

['/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA10_VM_EP_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA10_VM_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA10_VM_KV_KAM.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA11_VM_EP_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA11_VM_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA11_VM_KV_KAM.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW2_HA4_VM_EP_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW2_HA4_VM_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_full

---
* file_list를 인자값으로 전달하여 **similarity_model**을 구성하고, 
    * 해당 모델(similarity_model)과 
    * 저장된 경로(model_save_path)를 반환 받자

In [5]:
similarity_model, model_save_path = Similarity.Build.similarity_model(GlobalParameter.Time_Interval, 
                                                                      GlobalParameter.Fully_Preprocessed_Path)

calculating covariance
	run_time: 2.67620491982 sec
calculating cosine_similarity
	run_time: 3.37587618828 sec
calculating euclidean_distance
	run_time: 3.22806119919 sec
calculating manhattan_distance
	run_time: 4.99802017212 sec
calculating gradient_similarity
	run_time: 3.63395404816 sec
calculating reversed_gradient_similarity
	run_time: 3.4038131237 sec


---
* 반환 받은 model_save_path를 확인해보자

In [6]:
model_save_path

'/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/similarity_model/similarity.bin'

---
* 반환 받은 similarity_model을 확인해보자

In [7]:
similarity_model

{'cosine_similarity': array([[ 0.        ,  0.04395962,  1.91224366,  0.07570824,  0.08059264,
          1.57521987,  0.31992838,  0.08303485,  2.44220136],
        [ 0.04395962,  0.        ,  1.91957027,  0.11966787,  0.09768805,
          1.58987308,  0.3590036 ,  0.06838164,  2.44220136],
        [ 1.91224366,  1.91957027,  0.        ,  1.93910788,  1.93910788,
          0.363888  ,  2.00748951,  1.95620329,  2.44220136],
        [ 0.07570824,  0.11966787,  1.93910788,  0.        ,  0.0048844 ,
          1.56300887,  0.14897428,  0.05617063,  2.44220136],
        [ 0.08059264,  0.09768805,  1.93910788,  0.0048844 ,  0.        ,
          1.56300887,  0.15385869,  0.04884403,  2.44220136],
        [ 1.57521987,  1.58987308,  0.363888  ,  1.56300887,  1.56300887,
          0.        ,  1.65337032,  1.58743088,  2.44220136],
        [ 0.31992838,  0.3590036 ,  2.00748951,  0.14897428,  0.15385869,
          1.65337032,  0.        ,  0.20026051,  2.44220136],
        [ 0.08303485,  0.06

---
### Similarity Model  

* **similarity_model**의 구성
    * file_list
    * cosine_similarity
    * euclidean_distance
    * manhatton_distance
    * gradient_similarity
    * reversed_gradient_similarity

---
* **file_list**
    * preprocessed_repository 아래에 있는 data file의 abs_path를 list로 관리하는 항목임임
        * 각 file의 list_idx는 차후 similarity_matrix에서 row와 column의 idx와 일치하게 됨

In [8]:
similarity_model['file_list']

array([ '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA10_VM_EP_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA10_VM_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA10_VM_KV_KAM.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA11_VM_EP_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA11_VM_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW1_HA11_VM_KV_KAM.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW2_HA4_VM_EP_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/60/refined_data_fully_refined/VTT_GW2_HA4_VM_KV_K.bin',
       '/Users/JH/Documents/

---
* **covariance**
    * 각 data 사이 **covariance**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [9]:
similarity_model['covariance']

array([[ 0.10783927,  1.83853107,  3.1777161 ,  1.81450021,  1.83338054,
         3.20926082,  2.04357923,  1.866151  ,  3.31663134],
       [ 1.83853107,  0.14463336,  3.20216202,  1.85875227,  1.84971161,
         3.28666996,  2.03977447,  1.852211  ,  3.31834099],
       [ 3.1777161 ,  3.20216202,  0.31389509,  3.18443644,  3.1900068 ,
         2.14644844,  3.20690894,  3.21880246,  3.31081514],
       [ 1.81450021,  1.85875227,  3.18443644,  0.        ,  1.66117722,
         3.1484395 ,  1.81826003,  1.74386952,  3.3159623 ],
       [ 1.83338054,  1.84971161,  3.1900068 ,  1.66117722,  0.00332314,
         3.16136215,  1.81229214,  1.74488395,  3.31629988],
       [ 3.20926082,  3.28666996,  2.14644844,  3.1484395 ,  3.16136215,
         0.28064634,  3.14032727,  3.22022906,  3.31184543],
       [ 2.04357923,  2.03977447,  3.20690894,  1.81826003,  1.81229214,
         3.14032727,  0.06157058,  1.85255153,  3.31403654],
       [ 1.866151  ,  1.852211  ,  3.21880246,  1.74386952,  1

---
* **cosine_similarity**
    * 각 data 사이 **cosine simialrity**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [10]:
similarity_model['cosine_similarity']

array([[ 0.        ,  0.04395962,  1.91224366,  0.07570824,  0.08059264,
         1.57521987,  0.31992838,  0.08303485,  2.44220136],
       [ 0.04395962,  0.        ,  1.91957027,  0.11966787,  0.09768805,
         1.58987308,  0.3590036 ,  0.06838164,  2.44220136],
       [ 1.91224366,  1.91957027,  0.        ,  1.93910788,  1.93910788,
         0.363888  ,  2.00748951,  1.95620329,  2.44220136],
       [ 0.07570824,  0.11966787,  1.93910788,  0.        ,  0.0048844 ,
         1.56300887,  0.14897428,  0.05617063,  2.44220136],
       [ 0.08059264,  0.09768805,  1.93910788,  0.0048844 ,  0.        ,
         1.56300887,  0.15385869,  0.04884403,  2.44220136],
       [ 1.57521987,  1.58987308,  0.363888  ,  1.56300887,  1.56300887,
         0.        ,  1.65337032,  1.58743088,  2.44220136],
       [ 0.31992838,  0.3590036 ,  2.00748951,  0.14897428,  0.15385869,
         1.65337032,  0.        ,  0.20026051,  2.44220136],
       [ 0.08303485,  0.06838164,  1.95620329,  0.05617063,  0

---
* **euclidean_distance**
    * 각 data 사이 **euclidean distance**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [11]:
similarity_model['euclidean_distance']

array([[ 0.        ,  0.65945954,  2.67370812,  0.64226099,  0.65135315,
         2.49971629,  1.29192096,  0.67627539,  2.636145  ],
       [ 0.65945954,  0.        ,  3.02290948,  1.00924077,  0.91827283,
         2.84002265,  1.63830422,  0.74482606,  3.02786146],
       [ 2.67370812,  3.02290948,  0.        ,  2.54651073,  2.60538939,
         0.85975368,  2.26964071,  2.75267302,  1.61926172],
       [ 0.64226099,  1.00924077,  2.54651073,  0.        ,  0.15557558,
         2.35659603,  0.86055388,  0.5823154 ,  2.46934339],
       [ 0.65135315,  0.91827283,  2.60538939,  0.15557558,  0.        ,
         2.41120695,  0.91380793,  0.53279562,  2.53829996],
       [ 2.49971629,  2.84002265,  0.85975368,  2.35659603,  2.41120695,
         0.        ,  2.12262381,  2.55862974,  1.73234539],
       [ 1.29192096,  1.63830422,  2.26964071,  0.86055388,  0.91380793,
         2.12262381,  0.        ,  1.12502657,  2.08494472],
       [ 0.67627539,  0.74482606,  2.75267302,  0.5823154 ,  0

---
* **manhatton_distance**
    * 각 data 사이 **manhatton distance**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [12]:
similarity_model['manhattan_distance']

array([[ 0.        ,  0.70143691,  2.57887866,  0.71371292,  0.76266971,
         2.21805604,  1.28795663,  0.80851937,  2.8401914 ],
       [ 0.70143691,  0.        ,  3.27525046,  1.24669069,  1.15013752,
         2.9014812 ,  1.88317952,  0.90162289,  3.54162331],
       [ 2.57887866,  3.27525046,  0.        ,  2.2435543 ,  2.37809049,
         0.75545609,  1.60120584,  2.68895139,  0.4381868 ],
       [ 0.71371292,  1.24669069,  2.2435543 ,  0.        ,  0.1363811 ,
         1.97278256,  0.88232689,  0.60942044,  2.45983416],
       [ 0.76266971,  1.15013752,  2.37809049,  0.1363811 ,  0.        ,
         2.1011579 ,  1.00381529,  0.56044782,  2.59621466],
       [ 2.21805604,  2.9014812 ,  0.75545609,  1.97278256,  2.1011579 ,
         0.        ,  1.53719211,  2.44169554,  0.91600122],
       [ 1.28795663,  1.88317952,  1.60120584,  0.88232689,  1.00381529,
         1.53719211,  0.        ,  1.23692154,  1.67320237],
       [ 0.80851937,  0.90162289,  2.68895139,  0.60942044,  0

---
* **gradient_similarity**
    * 각 data 사이 **gradient simialrity**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [13]:
similarity_model['gradient_similarity']

array([[ 0.        ,  0.15347773,  1.69460376,  0.10623185,  0.16536898,
         2.35744011,  0.12511005,  0.18928584,  1.243828  ],
       [ 0.15347773,  0.        ,  1.80352289,  0.17097869,  0.14459289,
         2.44667487,  0.16819063,  0.16780433,  1.25207462],
       [ 1.69460376,  1.80352289,  0.        ,  1.72747265,  1.77260229,
         1.85258269,  1.69572906,  1.81208862,  2.79662694],
       [ 0.10623185,  0.17097869,  1.72747265,  0.        ,  0.1128493 ,
         2.3352364 ,  0.09929529,  0.17361559,  1.20855735],
       [ 0.16536898,  0.14459289,  1.77260229,  0.1128493 ,  0.        ,
         2.40801151,  0.13372616,  0.15413276,  1.21112707],
       [ 2.35744011,  2.44667487,  1.85258269,  2.3352364 ,  2.40801151,
         0.        ,  2.33965363,  2.45374579,  3.43986289],
       [ 0.12511005,  0.16819063,  1.69572906,  0.09929529,  0.13372616,
         2.33965363,  0.        ,  0.16545296,  1.16779455],
       [ 0.18928584,  0.16780433,  1.81208862,  0.17361559,  0

---
* **reversed_gradient_similarity**
    * 각 data 사이 **reversed gradient simialrity**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [14]:
similarity_model['reversed_gradient_similarity']

array([[ 0.2311555 ,  0.23848662,  1.5675068 ,  0.19959264,  0.20186108,
         2.14289426,  0.16274176,  0.23842653,  1.1125119 ],
       [ 0.23848662,  0.24581773,  1.60976588,  0.20692376,  0.2091922 ,
         2.18211874,  0.17017804,  0.24575764,  1.11991813],
       [ 1.5675068 ,  1.60976588,  3.01044472,  1.54908888,  1.57866873,
         3.33870747,  1.51572328,  1.61418258,  2.50193868],
       [ 0.19959264,  0.20692376,  1.54908888,  0.16802978,  0.17029822,
         2.11041501,  0.13152443,  0.20686367,  1.08097908],
       [ 0.20186108,  0.2091922 ,  1.57866873,  0.17029822,  0.17256666,
         2.14594388,  0.13380789,  0.20913211,  1.08327757],
       [ 2.14289426,  2.18211874,  3.33870747,  2.11041501,  2.14594388,
         4.16470492,  2.09205718,  2.19093711,  3.07743129],
       [ 0.16274176,  0.17017804,  1.51572328,  0.13152443,  0.13380789,
         2.09205718,  0.09504912,  0.17031325,  1.04453382],
       [ 0.23842653,  0.24575764,  1.61418258,  0.20686367,  0

In [15]:
_, _ = Similarity.Build.similarity_model(GlobalParameter.Time_Interval,GlobalParameter.Semi_Preprocessed_Path)

calculating covariance
	run_time: 3.01052808762 sec
calculating cosine_similarity
	run_time: 3.64429688454 sec
calculating euclidean_distance
	run_time: 3.26734900475 sec
calculating manhattan_distance
	run_time: 3.04596590996 sec
calculating gradient_similarity
	run_time: 3.40316390991 sec
calculating reversed_gradient_similarity
	run_time: 3.44988083839 sec


In [16]:
_, _ = Similarity.Build.similarity_model(30,GlobalParameter.Fully_Preprocessed_Path)
print 
_, _ = Similarity.Build.similarity_model(30,GlobalParameter.Semi_Preprocessed_Path)

calculating covariance
	run_time: 6.15459012985 sec
calculating cosine_similarity
	run_time: 7.06285715103 sec
calculating euclidean_distance
	run_time: 6.92847514153 sec
calculating manhattan_distance
	run_time: 6.41056799889 sec
calculating gradient_similarity
	run_time: 6.98600482941 sec
calculating reversed_gradient_similarity
	run_time: 6.91096282005 sec

calculating covariance
	run_time: 6.09513902664 sec
calculating cosine_similarity
	run_time: 7.48777580261 sec
calculating euclidean_distance
	run_time: 6.83860993385 sec
calculating manhattan_distance
	run_time: 6.41062617302 sec
calculating gradient_similarity
	run_time: 6.46687412262 sec
calculating reversed_gradient_similarity
	run_time: 7.78916811943 sec


In [17]:
_, _ = Similarity.Build.similarity_model(15,GlobalParameter.Fully_Preprocessed_Path)
print 
_, _ = Similarity.Build.similarity_model(15,GlobalParameter.Semi_Preprocessed_Path)

calculating covariance
	run_time: 12.6768641472 sec
calculating cosine_similarity
	run_time: 17.85947299 sec
calculating euclidean_distance
	run_time: 16.0510501862 sec
calculating manhattan_distance
	run_time: 14.9975569248 sec
calculating gradient_similarity
	run_time: 14.0315999985 sec
calculating reversed_gradient_similarity
	run_time: 14.111702919 sec

calculating covariance
	run_time: 12.0890350342 sec
calculating cosine_similarity
	run_time: 14.7661991119 sec
calculating euclidean_distance
	run_time: 15.1407120228 sec
calculating manhattan_distance
	run_time: 13.8755540848 sec
calculating gradient_similarity
	run_time: 14.4772369862 sec
calculating reversed_gradient_similarity
	run_time: 13.9717969894 sec
