## Clustering  

Exploratory clustering of T2D and lipids data. The goal of clustering is to determine whether there is a certain subset of points that are similar to each other - whether these splits are made between, across or within the trait labels. Clustering is highly dependent on the parameters chosen for distance measure and the number of desired clusters. Moreover, clustering is not guaranteed to produce useful results, depending on the structure of the data and on the definition of what a 'useful' cluster is. 

In [2]:
import pandas as pd
import numpy as np
import boto3
import s3fs
import os
import sys
import warnings
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
from scipy import stats
from matplotlib import pyplot as plt
plt.style.use('ggplot')
warnings.filterwarnings(action='ignore')

import jupyterthemes
from jupyterthemes import jtplot
jtplot.style(theme='oceans16')
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from Evaluator import Evaluator
from auth import access_key, secret_key

In [3]:
filepath = "s3://voightlab-data/"
lipids_df = pd.read_table(filepath + "lipids/lipids_groups.txt")
t2d_df = pd.read_table(filepath + "t2d/t2d_groups.txt")
both_df = pd.read_table(filepath + "grouped/lipids_T2D_overlapping_groups.txt")

In [4]:
lipids_df.head()

Unnamed: 0,Chr1_Group_1001
0,Chr1_Group_1008
1,Chr1_Group_1009
2,Chr1_Group_1037
3,Chr1_Group_1053
4,Chr1_Group_107617707


In [7]:
grouped_df = pd.read_table(filepath + "grouped/ML_table_grouped_snpcount.txt")
grouped_df.head()

Unnamed: 0,snp,MCF-7_ChIP-seq_CTCF_ENCSR000AHD_ENCFF001UML_ENCFF001UMN_intersect.bed,MCF-7_ChIP-seq_TAF1_ENCSR000AHF_ENCFF001UNU_ENCFF001UNT_intersect.bed,GM12878_ChIP-seq_CTCF_ENCFF002CDP.bed,K562_ChIP-seq_CTCF_ENCFF002CEL.bed,K562_ChIP-seq_POLR2A_ENCFF002CET.bed,endothelial_cell_of_umbilical_vein_ChIP-seq_CTCF_ENCFF002CEH.bed,endothelial_cell_of_umbilical_vein_ChIP-seq_POLR2A_ENCFF002CEJ.bed,keratinocyte_ChIP-seq_CTCF_ENCFF002CFA.bed,keratinocyte_ChIP-seq_POLR2A_ENCFF002CFC.bed,...,Hepatocyte_PPARA_GW7647_2hr.bed,Hepatocyte_PPARA_GW7647_24hr.bed,liver_USF1_ctrl_peaks.narrowPeak,liver_USF1_ASH_peaks.narrowPeak,islet_pooled_H3K4me1_final.bed,islet_CTCF_intersectall.bed,islet_H3K27ac.bed,islet_pooled_H3K27ac.bed,islet_pooled_H3K4me3_peaks.broadPeak,snpcount
0,Chr1_Group_1,0,0,0,1,1,1,0,1,0,...,0,0,0,0,0,0,1,1,1,5
1,Chr1_Group_10,1,0,1,1,0,1,0,1,0,...,0,0,0,0,0,1,0,1,0,52
2,Chr1_Group_100,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,1,1,1,57
3,Chr1_Group_1000,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,25
4,Chr1_Group_100046246,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,5


In [6]:
# Normalize the snpcount column which is continuous, to fall between 0 and 1
if 'snpcount' in grouped_df.columns:
    grouped_df['snpcount'] = (grouped_df['snpcount'] - grouped_df['snpcount'].min())/ (grouped_df['snpcount'].max() - grouped_df['snpcount'].min())

### Agglomerative Clustering

A hierarchical clustering method with a bottom-up approach. Each observation starts in its own cluster and clusters are iteratively merged in a way to minimize some linkage criterion. 

In [None]:
agg_cluster = AgglomerativeClustering(n_clusters=3,
                                      affinity='l1',
                                      linkage='average')