# Build & Save Similarity Model
---

### 개요
* **Preprocessed_repository**로 부터 **preprocessing** 된 data를 불러와 각 data 사이 **유사도(similarity)**를 계산하여 하나의 **유사도 모델(similarity_model)**을 구성하여 반환/저장함

---
* 아래는 저장되어있는 preprocessed_data 사이 similarity를 계산하여 similarity_model을 구성/저장하는 과정임  

<img src="https://raw.githubusercontent.com/jhyun0919/EnergyData_jhyun/master/docs/images/%EC%8A%A4%ED%81%AC%EB%A6%B0%EC%83%B7%202016-05-18%20%EC%98%A4%EC%A0%84%2010.26.43.jpg" alt="Drawing" style="width: 700px;"/>

---
* similarity 계산과 save 과정에 필요한 module들을 import 하자

In [1]:
from utils import GlobalParameter
from utils import FileIO
from utils import Similarity
import os

---
* 다음 과정은 repository의 경로를 지정하고 확인하는 과정이다

In [2]:
repository4prepodessed_path = os.path.join(GlobalParameter.RepositoryPath, 
                                           str(GlobalParameter.TimeInterval), 
                                           GlobalParameter.FullyPreprocessedPath)
repository4prepodessed_path

'/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined'

---
* 지정된 경로 아래에 있는 preprocessed_data file들의 abs_path를 list로 만들어 반환하자

In [3]:
file_list = FileIO.Load.binary_file_list(repository4prepodessed_path)
file_list

['/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA10_VM_EP_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA10_VM_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA10_VM_KV_KAM.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA11_VM_EP_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA11_VM_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA11_VM_KV_KAM.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW2_HA4_VM_EP_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW2_HA4_VM_KV_K.bin',
 '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW2_HA4_VM_KV_KAM.bin']

---
* file_list를 인자값으로 전달하여 **similarity_model**을 구성하고, 
    * 해당 모델(similarity_model)과 
    * 저장된 경로(model_save_path)를 반환 받자

In [4]:
similarity_model, model_save_path = Similarity.Build.similarity_model(GlobalParameter.TimeInterval, 
                                                                      GlobalParameter.FullyPreprocessedPath)

calculating covariance
	run_time: 0.329840898514 sec
calculating cosine_similarity
	run_time: 0.157385110855 sec
calculating euclidean_distance
	run_time: 0.161708831787 sec
calculating manhattan_distance
	run_time: 0.14306306839 sec
calculating gradient_similarity
	run_time: 0.148790836334 sec
calculating reversed_gradient_similarity
	run_time: 0.152146100998 sec


---
* 반환 받은 model_save_path를 확인해보자

In [5]:
model_save_path

'/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/similarity_model/similarity.bin'

---
* 반환 받은 similarity_model을 확인해보자

In [6]:
similarity_model

{'cosine_similarity': array([[ 0.        ,  0.04554733,  1.81936264,  0.07844262,  0.08350343,
          1.45751444,  0.33148332,  0.08350343,  2.52787661],
        [ 0.04554733,  0.        ,  1.81683224,  0.12145954,  0.10121628,
          1.46510566,  0.36943942,  0.0708514 ,  2.52787661],
        [ 1.81936264,  1.81683224,  0.        ,  1.84466671,  1.8396059 ,
          0.54403751,  1.93829177,  1.86237956,  2.52787661],
        [ 0.07844262,  0.12145954,  1.84466671,  0.        ,  0.00506081,
          1.42967996,  0.15435483,  0.05566895,  2.53040701],
        [ 0.08350343,  0.10121628,  1.8396059 ,  0.00506081,  0.        ,
          1.42461915,  0.15688523,  0.05060814,  2.53040701],
        [ 1.45751444,  1.46510566,  0.54403751,  1.42967996,  1.42461915,
          0.        ,  1.54101787,  1.45751444,  2.51522457],
        [ 0.33148332,  0.36943942,  1.93829177,  0.15435483,  0.15688523,
          1.54101787,  0.        ,  0.20749338,  2.53040701],
        [ 0.08350343,  0.07

---
### Similarity Model  

* **similarity_model**의 구성
    * file_list
    * cosine_similarity
    * euclidean_distance
    * manhatton_distance
    * gradient_similarity
    * reversed_gradient_similarity

---
* **file_list**
    * preprocessed_repository 아래에 있는 data file의 abs_path를 list로 관리하는 항목임임
        * 각 file의 list_idx는 차후 similarity_matrix에서 row와 column의 idx와 일치하게 됨

In [7]:
similarity_model['file_list']

array([ '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA10_VM_EP_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA10_VM_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA10_VM_KV_KAM.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA11_VM_EP_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA11_VM_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW1_HA11_VM_KV_KAM.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW2_HA4_VM_EP_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW2_HA4_VM_KV_K.bin',
       '/Users/JH/Documents/GitHub/EnergyData_jhyun/repository/1440/fully_refined/VTT_GW2_HA4_VM_KV_KAM.bin'], 
    

---
* **covariance**
    * 각 data 사이 **covariance**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [8]:
similarity_model['covariance']

array([[ 0.10981464,  1.88461259,  3.28111533,  1.86655591,  1.88227246,
         3.30735225,  2.09996899,  1.92083406,  3.41030268],
       [ 1.88461259,  0.1399092 ,  3.29415553,  1.90822784,  1.89557636,
         3.37444243,  2.09304489,  1.90340021,  3.41225427],
       [ 3.28111533,  3.29415553,  0.96759117,  3.27952731,  3.28124122,
         2.7859469 ,  3.30038472,  3.3094249 ,  3.40369736],
       [ 1.86655591,  1.90822784,  3.27952731,  0.00454181,  1.70824424,
         3.23730648,  1.87208066,  1.7988169 ,  3.40953801],
       [ 1.88227246,  1.89557636,  3.28124122,  1.70824424,  0.        ,
         3.24639716,  1.86233294,  1.79612812,  3.40993463],
       [ 3.30735225,  3.37444243,  2.7859469 ,  3.23730648,  3.24639716,
         0.83880305,  3.22909087,  3.3065396 ,  3.40425919],
       [ 2.09996899,  2.09304489,  3.30038472,  1.87208066,  1.86233294,
         3.22909087,  0.06600969,  1.90932193,  3.40735266],
       [ 1.92083406,  1.90340021,  3.3094249 ,  1.7988169 ,  1

---
* **cosine_similarity**
    * 각 data 사이 **cosine simialrity**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [9]:
similarity_model['cosine_similarity']

array([[ 0.        ,  0.04554733,  1.81936264,  0.07844262,  0.08350343,
         1.45751444,  0.33148332,  0.08350343,  2.52787661],
       [ 0.04554733,  0.        ,  1.81683224,  0.12145954,  0.10121628,
         1.46510566,  0.36943942,  0.0708514 ,  2.52787661],
       [ 1.81936264,  1.81683224,  0.        ,  1.84466671,  1.8396059 ,
         0.54403751,  1.93829177,  1.86237956,  2.52787661],
       [ 0.07844262,  0.12145954,  1.84466671,  0.        ,  0.00506081,
         1.42967996,  0.15435483,  0.05566895,  2.53040701],
       [ 0.08350343,  0.10121628,  1.8396059 ,  0.00506081,  0.        ,
         1.42461915,  0.15688523,  0.05060814,  2.53040701],
       [ 1.45751444,  1.46510566,  0.54403751,  1.42967996,  1.42461915,
         0.        ,  1.54101787,  1.45751444,  2.51522457],
       [ 0.33148332,  0.36943942,  1.93829177,  0.15435483,  0.15688523,
         1.54101787,  0.        ,  0.20749338,  2.53040701],
       [ 0.08350343,  0.0708514 ,  1.86237956,  0.05566895,  0

---
* **euclidean_distance**
    * 각 data 사이 **euclidean distance**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [10]:
similarity_model['euclidean_distance']

array([[ 0.        ,  0.70288244,  2.65457661,  0.68304472,  0.69220987,
         2.48622731,  1.37356627,  0.71536393,  2.73748502],
       [ 0.70288244,  0.        ,  3.04837649,  1.07654312,  0.97735451,
         2.86953551,  1.74559893,  0.79670462,  3.16824698],
       [ 2.65457661,  3.04837649,  0.        ,  2.49955295,  2.57052255,
         0.87780412,  2.17388871,  2.72747572,  1.19798133],
       [ 0.68304472,  1.07654312,  2.49955295,  0.        ,  0.16798752,
         2.31257187,  0.91434412,  0.61400463,  2.55249374],
       [ 0.69220987,  0.97735451,  2.57052255,  0.16798752,  0.        ,
         2.37829563,  0.97488233,  0.56088294,  2.63407562],
       [ 2.48622731,  2.86953551,  0.87780412,  2.31257187,  2.37829563,
         0.        ,  2.03743654,  2.5372386 ,  1.42102003],
       [ 1.37356627,  1.74559893,  2.17388871,  0.91434412,  0.97488233,
         2.03743654,  0.        ,  1.1904442 ,  2.12969099],
       [ 0.71536393,  0.79670462,  2.72747572,  0.61400463,  0

---
* **manhatton_distance**
    * 각 data 사이 **manhatton distance**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [11]:
similarity_model['manhattan_distance']

array([[ 0.        ,  0.70627029,  2.57022038,  0.71509852,  0.76243156,
         2.21051041,  1.28946156,  0.8065271 ,  2.84866209],
       [ 0.70627029,  0.        ,  3.26769381,  1.25103092,  1.15184947,
         2.89616912,  1.88940793,  0.90807035,  3.55488388],
       [ 2.57022038,  3.26769381,  0.        ,  2.23965358,  2.37931085,
         0.73867091,  1.60578338,  2.67418819,  0.4323297 ],
       [ 0.71509852,  1.25103092,  2.23965358,  0.        ,  0.14254677,
         1.96381328,  0.88244993,  0.60381711,  2.46673463],
       [ 0.76243156,  1.15184947,  2.37931085,  0.14254677,  0.        ,
         2.09683012,  1.00890621,  0.55459577,  2.6092757 ],
       [ 2.21051041,  2.89616912,  0.73867091,  1.96381328,  2.09683012,
         0.        ,  1.53690603,  2.42181474,  0.90401707],
       [ 1.28946156,  1.88940793,  1.60578338,  0.88244993,  1.00890621,
         1.53690603,  0.        ,  1.22975758,  1.6799263 ],
       [ 0.8065271 ,  0.90807035,  2.67418819,  0.60381711,  0

---
* **gradient_similarity**
    * 각 data 사이 **gradient simialrity**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [12]:
similarity_model['gradient_similarity']

array([[ 0.        ,  0.10607928,  1.90557234,  0.2342841 ,  0.26161339,
         2.57107333,  0.27672352,  0.30158585,  0.41421799],
       [ 0.10607928,  0.        ,  1.96181132,  0.27306162,  0.26477418,
         2.63698741,  0.2897907 ,  0.29414642,  0.43345257],
       [ 1.90557234,  1.96181132,  0.        ,  1.94585317,  1.9702915 ,
         2.00374964,  1.87889834,  2.02868906,  1.96439392],
       [ 0.2342841 ,  0.27306162,  1.94585317,  0.        ,  0.03981827,
         2.5235458 ,  0.2203689 ,  0.28493387,  0.33330941],
       [ 0.26161339,  0.26477418,  1.9702915 ,  0.03981827,  0.        ,
         2.54039051,  0.22468608,  0.27992286,  0.339631  ],
       [ 2.57107333,  2.63698741,  2.00374964,  2.5235458 ,  2.54039051,
         0.        ,  2.51267576,  2.64862838,  2.60302817],
       [ 0.27672352,  0.2897907 ,  1.87889834,  0.2203689 ,  0.22468608,
         2.51267576,  0.        ,  0.20055613,  0.23925656],
       [ 0.30158585,  0.29414642,  2.02868906,  0.28493387,  0

---
* **reversed_gradient_similarity**
    * 각 data 사이 **reversed gradient simialrity**를 계산하여 해당 유사도(similarity)를 **symmetric matrix**로 구성

In [13]:
similarity_model['reversed_gradient_similarity']

array([[ 0.53833599,  0.55584057,  1.80892376,  0.46502679,  0.47076829,
         2.37400675,  0.37792398,  0.55198957,  0.3757184 ],
       [ 0.55584057,  0.57338017,  1.83258995,  0.48253137,  0.48827288,
         2.38888565,  0.39546357,  0.56938912,  0.39297792],
       [ 1.80892376,  1.83258995,  3.35408843,  1.78231679,  1.79057895,
         3.5933761 ,  1.6972095 ,  1.83395531,  1.77275928],
       [ 0.46502679,  0.48253137,  1.78231679,  0.39168258,  0.39742408,
         2.31935744,  0.30594513,  0.47987068,  0.30268927],
       [ 0.47076829,  0.48827288,  1.79057895,  0.39742408,  0.40316559,
         2.33133057,  0.31165162,  0.4856822 ,  0.30839577],
       [ 2.37400675,  2.38888565,  3.5933761 ,  2.31935744,  2.33133057,
         4.51712803,  2.27321535,  2.40337944,  2.35951295],
       [ 0.37792398,  0.39546357,  1.6972095 ,  0.30594513,  0.31165162,
         2.27321535,  0.22041773,  0.39427326,  0.21614661],
       [ 0.55198957,  0.56938912,  1.83395531,  0.47987068,  0

In [14]:
_, _ = Similarity.Build.similarity_model(GlobalParameter.TimeInterval,GlobalParameter.SemiPreprocessedPath)

calculating covariance
	run_time: 0.256426095963 sec
calculating cosine_similarity
	run_time: 0.242129087448 sec
calculating euclidean_distance
	run_time: 0.308069944382 sec
calculating manhattan_distance
	run_time: 0.301975011826 sec
calculating gradient_similarity
	run_time: 0.283159017563 sec
calculating reversed_gradient_similarity
	run_time: 0.157516002655 sec


In [16]:
_, _ = Similarity.Build.similarity_model(30,GlobalParameter.FullyPreprocessedPath)
print 
_, _ = Similarity.Build.similarity_model(30,GlobalParameter.SemiPreprocessedPath)

calculating covariance
	run_time: 6.18096089363 sec
calculating cosine_similarity
	run_time: 6.92706179619 sec
calculating euclidean_distance
	run_time: 6.7218439579 sec
calculating manhattan_distance
	run_time: 7.06306004524 sec
calculating gradient_similarity
	run_time: 13.7114930153 sec
calculating reversed_gradient_similarity
	run_time: 6.70810580254 sec

calculating covariance
	run_time: 5.87729310989 sec
calculating cosine_similarity
	run_time: 7.20353913307 sec
calculating euclidean_distance
	run_time: 7.11362099648 sec
calculating manhattan_distance
	run_time: 7.62489914894 sec
calculating gradient_similarity
	run_time: 9.53638792038 sec
calculating reversed_gradient_similarity
	run_time: 8.29822802544 sec


In [17]:
_, _ = Similarity.Build.similarity_model(60,GlobalParameter.FullyPreprocessedPath)
print 
_, _ = Similarity.Build.similarity_model(60,GlobalParameter.SemiPreprocessedPath)

calculating covariance
	run_time: 3.27749896049 sec
calculating cosine_similarity
	run_time: 3.59292411804 sec
calculating euclidean_distance
	run_time: 3.49220299721 sec
calculating manhattan_distance
	run_time: 3.26843905449 sec
calculating gradient_similarity
	run_time: 3.40499091148 sec
calculating reversed_gradient_similarity
	run_time: 3.39924407005 sec

calculating covariance
	run_time: 3.13887095451 sec
calculating cosine_similarity
	run_time: 3.59147691727 sec
calculating euclidean_distance
	run_time: 3.50006508827 sec
calculating manhattan_distance
	run_time: 3.89899611473 sec
calculating gradient_similarity
	run_time: 3.86369895935 sec
calculating reversed_gradient_similarity
	run_time: 6.75289797783 sec
