# Multidimensional scaling for bigBed Jaccard simililarity matrix

In this notebook we will try to come up with some notion of Euclidean distance for the Jaccard similarities we have computed between all of the bigBed pairs. Also it would be cool if we could do some clustering to see if we can recapitulate features of the data. Biclustering

In [17]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

Here, `metadata` is a file report TSV from the [ENCODE portal](https://www.encodeproject.org). `data` is a TSV file containing pairs of file IDs and the computed jaccard values.

In [9]:
metadata = pd.read_csv("../data/file_report_2020_4_15_2h_50m.tsv", sep="\t", header=1)
data = pd.read_csv("../data/jaccard.tsv", sep="\t")

We'll need to add a column corresponding to the file accessions. You can avoid this if you include the accession in the report you download from ENCODE.

In [13]:
metadata["Accession"] = metadata.ID.str.split("/").map(lambda elem: elem[-2])
metadata.head()

Unnamed: 0,ID,Dataset,Biological replicates,Biosample name,Target label,Cloud metadata,Biosample ontology,Accession
0,/files/ENCFF145CKY/,/experiments/ENCSR331HPA/,12,GM12878,GABPA,"{'md5sum_base64': 'KU4lHzdxxDMQXy0jPcCFdQ==', ...",/biosample-types/cell_line_EFO_0002784/,ENCFF145CKY
1,/files/ENCFF405KVS/,/experiments/ENCSR000EGT/,12,K562,IRF1,"{'md5sum_base64': '4UetgEthvZJtWG0SY74a7A==', ...",/biosample-types/cell_line_EFO_0002067/,ENCFF405KVS
2,/files/ENCFF483DQZ/,/experiments/ENCSR874HSH/,12,OCI-LY1,H2AFZ,"{'md5sum_base64': 'F1Au9eRLKW1AoUtGTbaObw==', ...",/biosample-types/cell_line_EFO_0005907/,ENCFF483DQZ
3,/files/ENCFF749YTS/,/experiments/ENCSR739IHN/,12,GM12878,TBX21,"{'md5sum_base64': 'U1nk0FgMu7PQkXCEMvuxPA==', ...",/biosample-types/cell_line_EFO_0002784/,ENCFF749YTS
4,/files/ENCFF272FFT/,/experiments/ENCSR086YIH/,12,MM.1S,H3K4me2,"{'md5sum_base64': 'mwUvruHFGkJpp19Mn3QhMg==', ...",/biosample-types/cell_line_EFO_0005724/,ENCFF272FFT


In [10]:
data.head()

Unnamed: 0,id1,id2,jaccard
0,ENCFF145CKY,ENCFF405KVS,0.02
1,ENCFF145CKY,ENCFF483DQZ,0.0
2,ENCFF145CKY,ENCFF749YTS,0.0
3,ENCFF145CKY,ENCFF272FFT,0.01
4,ENCFF145CKY,ENCFF198STH,0.0


Let's encode the file accessions as integer labels. That way we can use the transform the file IDs into numbers we can directly use to index into the similarity matrix we want to construct. We need to concat the columns because one of the file IDs will be missing from each one.

In [36]:
label_encoder = LabelEncoder()
label_encoder.fit(np.concatenate((data.id1.values, data.id2.values)))

LabelEncoder()

In [37]:
similarity = np.zeros((metadata.shape[0], metadata.shape[0]))
x_idxs = label_encoder.transform(data.id1.values)
y_idxs = label_encoder.transform(data.id2.values)
similarity[x_idxs, y_idxs] = data.jaccard.values

In [40]:
similarity.shape

(2677, 2677)

In [38]:
label_encoder.transform(["ENCFF675QHY"])

array([1818])