# Most Similar Package

<b> Goal: Create functions which feedback which packages are the most similar to a given package. Based on: </b>

    * Package and distribution identification analysis for both 30 minute and 24 minute clusters (with different n-clusters)

Note: All output is for model2 only.

In [1]:
import pandas as pd
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
print('Structure of files.')
!pwd
print('\nDifferent folders showing the different combinations of package and distribution identification analysis.')
!ls
print('\nEach folder has 4 similarity matrices where each .csv file is for a different eigenvector.')
!ls package_identification_24hourclusters_60clusters/

Structure of files.
C:\Users\ostus\Desktop\Capstone\AnomalyDetectionMachineData\eda\TimeSegmentClusters

Different folders showing the different combinations of package and distribution identification analysis.
Most Similar Package.ipynb
package_identification_24hourclusters_150clusters
package_identification_24hourclusters_225clusters
package_identification_24hourclusters_60clusters
package_identification_30minclusters_150clusters
package_identification_30minclusters_225clusters
package_identification_30minclusters_60clusters

Each folder has 4 similarity matrices where each .csv file is for a different eigenvector.
Package_Distribution_for_Eigenvector_1_Clusters_in_24_Hour_Load_Profiles.csv
Package_Distribution_for_Eigenvector_2_Clusters_in_24_Hour_Load_Profiles.csv
Package_Distribution_for_Eigenvector_3_Clusters_in_24_Hour_Load_Profiles.csv
Package_Distribution_for_Eigenvector_4_Clusters_in_24_Hour_Load_Profiles.csv


In [3]:
from sklearn.metrics import pairwise
import re

def similarityByTimeSegmentClusters(PSN, similarityMetric = 'euclidean', timeLoadProfile = '30min', nClusters = 150):
    """
    Inputs: 
    PSN number of interest, want to determine which PSNs are most similar to it. 
    Supported: similarityMetric based on distances in ['euclidean', 'cosine', 'manhattan']; 
    timeLoadProfile in ['30min', '24hour']; 
    nClusters in [60, 150, 225].
    Output: A sorted dictionary showing which PSNs are most similar to the PSN of interested.
    The function takes the m x n matrix (cluster x PSN matrix) computes a particular similarity matrix and outputs a sorted
    dictionary. For similarity metrics based on distances, the smaller the distance the more similar the items.
    """
    similarityDict = {}
    path = "./package_identification_{}clusters_{}clusters/".format(timeLoadProfile, str(nClusters))
    files = !ls $path
    similarities = []
    for file in files:
        eig = re.search('Eigenvector_(.*)_Clusters', file).group(1)
        data = pd.read_csv(path+file)
        data =  data.rename(columns = {'Unnamed: 0': 'PSN'}).set_index('PSN').T       
        similarityMatrix = pd.DataFrame(pairwise.pairwise_distances(data, metric = similarityMetric), columns=data.index, index=data.index)
        similar = similarityMatrix[str(PSN)].drop([str(PSN)], axis=0).sort_values(ascending=True).to_dict()
        similarities.append({'eig'+eig: similar})
    similarityDict['PSN'] = str(PSN)
    similarityDict['similarityMetric'] = similarityMetric
    similarityDict['timeLoadProfile'] = timeLoadProfile
    similarityDict['nClusters'] = str(nClusters)
    similarityDict['results'] = similarities
    return similarityDict

### Test the function

Find the most similar to PSN 34, at default.

In [4]:
similarityByTimeSegmentClusters?

In [5]:
similarityByTimeSegmentClusters(34)

{'PSN': '34',
 'similarityMetric': 'euclidean',
 'timeLoadProfile': '30min',
 'nClusters': '150',
 'results': [{'eig1': {'66': 577.3265973433062,
    '64': 680.7708865690424,
    '67': 681.6692746486378,
    '65': 728.5753221184478,
    '68': 959.6926591362466,
    '41': 989.5842561399206,
    '63': 1082.3114154438176,
    '47': 1092.749285060393,
    '50': 1094.754310336342,
    '46': 1095.1200847395687,
    '51': 1095.3501723193365,
    '40': 1096.9316295922913,
    '61': 1105.8693412876587,
    '69': 1106.1283831454648,
    '71': 1108.3857631709277,
    '39': 1118.17664078624,
    '38': 1143.002187224504,
    '60': 1143.2956747928333,
    '62': 1179.4697961372306,
    '53': 1207.2791723540997,
    '55': 1236.596538892132,
    '58': 1252.3801339848856,
    '45': 1254.4014508920181,
    '37': 1278.6989481500327,
    '56': 1284.4625335135315,
    '59': 1300.5875595283849,
    '57': 1314.4264909077267,
    '49': 1320.7001930794136,
    '48': 1339.6148700279496,
    '72': 1418.5718170046

Find the most similar to PSN 34, with additional parameters specified.

In [6]:
similarityByTimeSegmentClusters(PSN = 50, similarityMetric = 'cosine', timeLoadProfile = '24hour', nClusters = 225)

{'PSN': '50',
 'similarityMetric': 'cosine',
 'timeLoadProfile': '24hour',
 'nClusters': '225',
 'results': [{'eig1': {'62': 0.7297428230728351,
    '51': 0.8556624327025936,
    '49': 0.8743233765575282,
    '60': 0.965869401589419,
    '47': 0.9751240702447502,
    '48': 0.9835470337589605,
    '66': 0.9942259198712176,
    '59': 1.0,
    '61': 1.0,
    '34': 1.0,
    '63': 1.0,
    '64': 1.0,
    '65': 1.0,
    '67': 1.0,
    '68': 1.0,
    '69': 1.0,
    '58': 1.0,
    '57': 1.0,
    '55': 1.0,
    '71': 1.0,
    '53': 1.0,
    '46': 1.0,
    '45': 1.0,
    '42': 1.0,
    '41': 1.0,
    '40': 1.0,
    '39': 1.0,
    '38': 1.0,
    '37': 1.0,
    '36': 1.0,
    '35': 1.0,
    '56': 1.0,
    '72': 1.0}},
  {'eig2': {'39': 0.36553606983528875,
    '69': 0.627551839107233,
    '51': 0.6666666666666667,
    '63': 0.7113248654051871,
    '57': 0.8735116419524208,
    '36': 0.8855407955443277,
    '53': 0.9204140433266021,
    '35': 0.9210893391770424,
    '37': 0.9444888457632793,
    '6

### Next Steps?
Potential error handling needed, let me know.