## Data Set Information:

Taken from [source](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression).

The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. There are 38 control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the experiments, 15 measurements were registered of each protein per sample/mouse. Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic mice, there are 34x15, or 510 measurements. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse. 

The eight classes of mice are described based on features such as genotype, behavior and treatment. According to genotype, mice can be control or trisomic. According to behavior, some mice have been stimulated to learn (context-shock) and others have not (shock-context) and in order to assess the effect of the drug memantine in recovering the ability to learn in trisomic mice, some mice have been injected with the drug and others have not. 

Classes: 
c-CS-s: control mice, stimulated to learn, injected with saline (9 mice) 
c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice) 
c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice) 
c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice) 

t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice) 
t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice) 
t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice) 
t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice) 

The aim is to identify subsets of proteins that are discriminant between the classes. 

In [158]:
from collections import defaultdict

import pandas as pd

from sklearn import decomposition
from sklearn.cluster import KMeans, Birch, SpectralClustering
from sklearn.ensemble import RandomForestClassifier

import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

import numpy as np

In [88]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis


In [2]:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
matplotlib.style.use('ggplot')
%matplotlib inline 

In [128]:
data_path = "../data/external/data_geneMice/Data_Cortex_Nuclear.xls"
raw_df = pd.read_excel(data_path)
raw_df = raw_df.dropna()
raw_df.head()

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class
75,3415_1,0.649781,0.828696,0.405862,2.921435,5.167979,0.207174,0.17664,3.728084,0.239283,...,0.129363,0.486912,0.125152,0.146865,0.143517,1.627181,Control,Memantine,C/S,c-CS-m
76,3415_2,0.616481,0.841974,0.388584,2.862575,5.194163,0.223433,0.167725,3.64824,0.22103,...,0.143084,0.467833,0.112857,0.161132,0.145719,1.562096,Control,Memantine,C/S,c-CS-m
77,3415_3,0.637424,0.852882,0.400561,2.968155,5.35082,0.20879,0.173261,3.814545,0.2223,...,0.147673,0.462501,0.116433,0.160594,0.142879,1.571868,Control,Memantine,C/S,c-CS-m
78,3415_4,0.576815,0.75539,0.348346,2.624901,4.727509,0.205892,0.161192,3.77853,0.194153,...,0.12129,0.47911,0.102831,0.144238,0.141681,1.646608,Control,Memantine,C/S,c-CS-m
79,3415_5,0.542545,0.757917,0.350051,2.634509,4.735602,0.210526,0.165671,3.871971,0.194297,...,0.142617,0.438354,0.110614,0.155667,0.146408,1.607631,Control,Memantine,C/S,c-CS-m


In [180]:
class_labels = ['Genotype','Treatment','Behavior','class', 'MouseID']
melt_df = pd.melt(raw_df, id_vars = class_labels)
melt_df

Unnamed: 0,Genotype,Treatment,Behavior,class,MouseID,variable,value
0,Control,Memantine,C/S,c-CS-m,3415_1,DYRK1A_N,0.649781
1,Control,Memantine,C/S,c-CS-m,3415_2,DYRK1A_N,0.616481
2,Control,Memantine,C/S,c-CS-m,3415_3,DYRK1A_N,0.637424
3,Control,Memantine,C/S,c-CS-m,3415_4,DYRK1A_N,0.576815
4,Control,Memantine,C/S,c-CS-m,3415_5,DYRK1A_N,0.542545
5,Control,Memantine,C/S,c-CS-m,3415_6,DYRK1A_N,0.569918
6,Control,Memantine,C/S,c-CS-m,3415_7,DYRK1A_N,0.494053
7,Control,Memantine,C/S,c-CS-m,3415_8,DYRK1A_N,0.485692
8,Control,Memantine,C/S,c-CS-m,3415_9,DYRK1A_N,0.508725
9,Control,Memantine,C/S,c-CS-m,3415_10,DYRK1A_N,0.408177


I want to do a series of violin plots of the expression value for each protein type. To make the plots have proteins with similar magnitudes so that their sized appropriately, firstly get the mean values and sort the DF by that.

In [181]:
expression_level_means = melt_df.groupby(by=['variable']).median().reset_index()
expression_level_means.head()

Unnamed: 0,variable,value
0,ADARB1_N,1.136801
1,AKT_N,0.679605
2,AMPKA_N,0.351836
3,APP_N,0.406108
4,ARC_N,0.1194


In [182]:
variable_list = expression_level_means.sort_values(by='value').variable
len(variable_list)

77

In [183]:
variable_subsets = np.array(variable_list).reshape(7,11)
variable_subsets

array([['GFAP_N', 'pS6_N', 'ARC_N', 'GluR4_N', 'BCL2_N', 'pCFOS_N',
        'AcetylH3K9_N', 'H3AcK18_N', 'BAD_N', 'SNCA_N', 'ERBB4_N'],
       ['RRP1_N', 'RSK_N', 'pGSK3B_N', 'EGR1_N', 'CREB_N', 'pBRAF_N',
        'BAX_N', 'NUMB_N', 'nNOS_N', 'Tau_N', 'H3MeK4_N'],
       ['pCREB_N', 'GluR3_N', 'pAKT_N', 'SHH_N', 'JNK_N', 'MEK_N',
        'pMEK_N', 'P3525_N', 'CDK5_N', 'RAPTOR_N', 'BDNF_N'],
       ['PKCA_N', 'pJNK_N', 'BRAF_N', 'pNUMB_N', 'CAMKII_N', 'AMPKA_N',
        'pP70S6_N', 'DYRK1A_N', 'SOD1_N', 'P38_N', 'TIAM1_N'],
       ['APP_N', 'MTOR_N', 'S6_N', 'pRSK_N', 'SYP_N', 'pERK_N', 'IL1B_N',
        'NR2B_N', 'DSCR1_N', 'ITSN1_N', 'AKT_N'],
       ['TRKA_N', 'pNR2A_N', 'pMTOR_N', 'pNR1_N', 'pGSK3B_Tyr216_N',
        'P70S6_N', 'ELK_N', 'ADARB1_N', 'GSK3B_N', 'Ubiquitin_N', 'pELK_N'],
       ['CaNA_N', 'pPKCAB_N', 'pNR2B_N', 'pCASP9_N', 'pPKCG_N',
        'Bcatenin_N', 'PSD95_N', 'NR1_N', 'ERK_N', 'pCAMKII_N', 'NR2A_N']], dtype=object)

In [184]:
def plot_expression_values():
    for (i, variable_subset) in enumerate(variable_subsets):
        df = melt_df.loc[melt_df.variable.isin(variable_subset)]
        fig = ff.create_violin(df, data_header='value', group_header='variable',
                           height=500, width=800)
        yield py.iplot(fig, filename='Multiple Violins')



In [204]:
plots = plot_expression_values()

In [205]:
next(plots)

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


In [206]:
next(plots)

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


In [207]:
next(plots)

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


In [208]:
next(plots)

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


In [209]:
next(plots)

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


In [210]:
next(plots)

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


In [211]:
next(plots)

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


I'm building this explorer with the hope that it's general enough to be able to take any TDF data and carry out some simple exploration by keeping all the relevant data together, and can be easily used in different workflows.

I want to make it simple to add things to.

In [215]:
class Exploration(object):
    def __init__(self, raw_df, target_cols):
        """
        This constructor really just needs to take in the dataframe, and separate the data into features and targets.
        TODO: what else can go here?
        """
        cols = raw_df.columns

        self.df_class = raw_df[target_cols] # the data frame which consists solely of the target variables
        self.df_attributes = raw_df.drop(target_cols, axis=1) # the data frame which consists soley of the attributes
        

[source](http://scikit-learn.org/stable/modules/preprocessing.html)

4.3.1. Standardization, or mean removal and variance scaling

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The function scale provides a quick and easy way to perform this operation on a single array-like dataset:

In [214]:
from sklearn import preprocessing

In [216]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax


array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

In [250]:
class Exploration(object):
    def __init__(self, raw_df, target_cols):
        """
        This constructor really just needs to take in the dataframe, and separate the data into features and targets.
        TODO: what else can go here?
        """
        cols = raw_df.columns

        self.df_class = raw_df[target_cols] # the data frame which consists solely of the target variables
        self.df_attributes = raw_df.drop(target_cols, axis=1) # the data frame which consists soley of the attributes
    def preprocess_scale(self, scaler, **kwargs):
        self.df_attributes[:] = scaler.fit_transform(self.df_attributes[:], **kwargs)
    
class_labels = ['Genotype','Treatment','Behavior','class', 'MouseID']

explorer_1 = Exploration(raw_df, class_labels)
scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
explorer_1.preprocess_scale(scaler)
explorer_1.df_attributes.head()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,SHH_N,BAD_N,BCL2_N,pS6_N,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N
75,0.595653,0.534663,0.760986,0.65543,0.508589,0.370192,0.329445,0.413858,0.653814,0.217713,...,0.414403,0.200842,0.205061,0.74814,0.364809,0.454037,0.113633,0.10824,0.154429,0.62529
76,0.556332,0.546832,0.715753,0.631176,0.512471,0.440039,0.286874,0.399997,0.559454,0.199869,...,0.460238,0.167748,0.260224,0.829104,0.478722,0.415838,0.0829,0.167738,0.162581,0.576767
77,0.581062,0.55683,0.74711,0.674682,0.535697,0.377134,0.313307,0.428866,0.56602,0.230934,...,0.45493,0.221403,0.315221,0.780527,0.516819,0.405162,0.091838,0.165495,0.152069,0.584052
78,0.509495,0.467473,0.610414,0.533239,0.443286,0.364685,0.255676,0.422615,0.420507,0.189371,...,0.382157,0.166802,0.169442,0.597713,0.297787,0.438416,0.05784,0.097284,0.147637,0.639773
79,0.46903,0.469789,0.614876,0.537198,0.444486,0.384592,0.277065,0.438835,0.421254,0.193969,...,0.437705,0.16505,0.204584,0.709253,0.474846,0.356814,0.077294,0.14495,0.16513,0.610715


In [251]:
explorer_1 = Exploration(raw_df, class_labels)
scaler = preprocessing.StandardScaler()
explorer_1.preprocess_scale(scaler)
explorer_1.df_attributes.head()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,SHH_N,BAD_N,BCL2_N,pS6_N,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N
75,1.436707,1.126128,1.705383,1.703232,1.499958,-0.450542,0.04755,0.077228,0.862572,0.761816,...,0.415723,-0.575896,-0.681628,0.837643,-0.022167,0.591953,-0.598618,-0.883796,-1.109349,0.764351
76,1.232724,1.198861,1.381104,1.543155,1.528343,-0.028033,-0.303284,0.019442,0.354457,0.502343,...,0.743454,-0.777233,-0.311244,1.331291,0.653422,0.324221,-0.804406,-0.484846,-1.060919,0.564395
77,1.361014,1.25861,1.605906,1.830294,1.698169,-0.408545,-0.085447,0.139804,0.389816,0.95407,...,0.705498,-0.4508,0.058025,1.035111,0.87936,0.249394,-0.744553,-0.499891,-1.123373,0.594418
78,0.989748,0.724574,0.625917,0.89677,1.022459,-0.48385,-0.560389,0.113739,-0.393751,0.349682,...,0.185162,-0.782991,-0.920783,-0.079539,-0.419656,0.482464,-0.972203,-0.957253,-1.149702,0.824032
79,0.779828,0.738418,0.657906,0.9229,1.031234,-0.363433,-0.384122,0.181366,-0.38973,0.416548,...,0.582337,-0.793646,-0.684829,0.600539,0.63043,-0.089468,-0.841942,-0.637644,-1.045776,0.704288


In [301]:
explorer_1 = Exploration(raw_df, class_labels)
scaler = preprocessing.RobustScaler()
explorer_1.preprocess_scale(scaler)
explorer_1.df_attributes.head()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,SHH_N,BAD_N,BCL2_N,pS6_N,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N
75,1.355087,0.953254,1.432837,1.273848,1.192088,-0.266124,0.014999,0.16954,0.633387,1.039718,...,0.436549,-0.316852,-0.462465,0.599716,0.105433,0.42584,-0.382247,-0.556477,-0.77747,0.495506
76,1.192557,1.006348,1.178596,1.156103,1.213466,0.055942,-0.233701,0.131751,0.278061,0.737196,...,0.709267,-0.485437,-0.11434,0.933204,0.602043,0.222986,-0.584682,-0.229779,-0.737896,0.372148
77,1.294777,1.049963,1.354845,1.367309,1.341366,-0.234112,-0.07928,0.210461,0.302788,1.263868,...,0.677682,-0.212106,0.232738,0.733118,0.768125,0.166292,-0.525804,-0.2421,-0.78893,0.39067
78,0.998958,0.660125,0.586518,0.680651,0.832473,-0.291514,-0.415958,0.193416,-0.245162,0.559208,...,0.24469,-0.490259,-0.687248,-0.019895,-0.186751,0.342883,-0.749745,-0.61663,-0.810445,0.532325
79,0.831697,0.670231,0.611597,0.699871,0.839081,-0.199724,-0.291006,0.23764,-0.242349,0.637167,...,0.575195,-0.49918,-0.465473,0.439539,0.585143,-0.090456,-0.621607,-0.354904,-0.725522,0.458452


In [308]:
import copy

class Exploration(object):
    def __init__(self, raw_df, target_cols):
        """
        This constructor really just needs to take in the dataframe, and separate the data into features and targets.
        TODO: what else can go here?
        """
        cols = raw_df.columns

        self._df_class = raw_df[target_cols] # the data frame which consists solely of the target variables
        self._df_attributes = raw_df.drop(target_cols, axis=1) # the data frame which consists soley of the attributes
        
        self.reset_inputs()
        
    def reset_inputs(self):
        self.df_class = copy.copy(self._df_class)
        self.df_attributes = copy.copy(self._df_attributes)
                
    def preprocess_scale(self, scaler, columns = None, **kwargs):
        """TODO: enforce types"""
        columns = columns if columns else self.df_attributes.columns
        self.df_attributes[columns] = scaler.fit_transform(self.df_attributes[columns], **kwargs)
            
    def preprocess_normalise(self, normaliser, columns = None, **kwargs):
        """TODO: enforce types"""
        columns = columns if columns else self.df_attributes.columns
        self.df_attributes[columns] = scaler.fit_transform(self.df_attributes[columns], **kwargs)
            
    def preprocess(self, function, columns = None):
        """TODO: enforce types"""
        columns = columns if columns else self.df_attributes.columns
        self.df_attributes[columns] = function(self.df_attributes[columns], **kwargs)
        
    def set_class_to_explore(self, key):
            self.target_key = key
        
    def pca(self, n):
        X = self.df_attributes
        pca = decomposition.PCA(n_components=n)
        pca.fit(X)
        X = pca.transform(X)
        self.df_pca = pd.DataFrame(X)
        
    def _cluster_kmeans(self, **kwargs):
        n_clusters = len(set(self.df_class[self.target_key]))
        kmeans = KMeans(n_clusters=n_clusters, **kwargs).fit(self.df_pca)
        self.cluster_results = kmeans
        
    def _cluster_birch(self, **kwargs):
        n_clusters = len(set(self.df_class[self.target_key]))
        birch = Birch(n_clusters=n_clusters, **kwargs).fit(self.df_pca)
        self.cluster_results = birch
        
    def _cluster_spectral(self, **kwargs):
        n_clusters = len(set(self.df_class[self.target_key]))
        spectral = SpectralClustering(n_clusters=n_clusters, **kwargs).fit(self.df_pca)
        self.cluster_results = spectral
        
    def cluster(self, algo=key_kmeans, **kwargs):
        if algo==key_kmeans:
            self._cluster_kmeans(**kwargs)
        elif algo==key_birch:
            self._cluster_birch(**kwargs)
        elif algo==key_spectral:
            self._cluster_spectral(**kwargs)
        
    def pca_scatter_cluster(self, n = 2, algo = key_kmeans, **kwargs):
        self.pca(n)
        self.cluster(algo = algo, **kwargs)
        int_labels = self.cluster_results.labels_
        text_labels = ['Cluster {}'.format(l) for l in int_labels]
        self.df_pca[key_labels] = text_labels
        self.df_pca[key_colour] = int_labels
        if n == 3:
            return scatter_3d(self.df_pca)
        return scatter_2d(self.df_pca)
        
    def pca_scatter_class(self, n = 2):
        self.pca(n)
        class_values = self.df_class[self.target_key]
        text_labels = class_values
        df_colour_dict = dict([(class_label, i) for (i, class_label) in enumerate(set(class_values))])
        class_colours = np.array([df_colour_dict[key] for key in class_values], dtype=int)
        self.df_pca[key_labels] = text_labels
        self.df_pca[key_colour] = class_colours
        if n == 3:
            return scatter_3d(self.df_pca)
        return scatter_2d(self.df_pca)
    
    def compare_class_clusters_violin(self):
        cluster_int_labels = np.array(self.cluster_results.labels_)
        class_values = self.df_class[self.target_key]
        df_colour_dict = dict([(class_label, i) for (i, class_label) in enumerate(set(class_values))])
        class_int_labels = np.array([df_colour_dict[key] for key in class_values], dtype=int)
        df = pd.DataFrame()
        df['Class'] = class_values
        df['Cluster'] = cluster_int_labels
    
        fig = ff.create_violin(df, data_header='Cluster', group_header='Class',
                       height=500, width=800)
        return py.iplot(fig, filename='Multiple Violins')
   
    def compare_class_clusters_scatter(self):
        cluster_int_labels = np.array(self.cluster_results.labels_)
        class_values = self.df_class[self.target_key]
        
        count_dict = defaultdict(lambda : defaultdict(int))
        
        for cluster, class_value in zip(cluster_int_labels, class_values):
            count_dict[cluster][class_value] += 1
            
        df = pd.DataFrame.from_dict(count_dict)
        x = 'level_0'
        y = 'level_1'

        x,y = new_df[x], new_df[y]
        trace1 = go.Scatter(
        x=x,
        y=y,
        mode='markers',
        showlegend=True,
        marker=dict(
            size=new_df[0],
            color=new_df[0],
            colorscale='Jet',
            showscale=True,
            line=dict(
                color=new_df[0],
                width=0.5,
                colorscale='Jet',
            ),

            opacity=1.0
        )
        )

        data = [trace1]
        layout = go.Layout(
            margin=dict(
                l=0,
                r=0,
                b=0,
                t=0
            ),
            xaxis=go.XAxis(
                ticks="",
                showticklabels=True,
                tickvals=list(range(len(df.index))),
                ticktext=df.index,
                tickmode="array"
            )
        )
        fig = go.Figure(data=data, layout=layout)
        return py.iplot(fig, filename='simple-3d-scatter')        
    
   
    
    def classify(self, algo = key_random_forest, train_fraction = 0.9, **kwargs):
        
        classifiers = {
            'kneighbors': KNeighborsClassifier,
            'svc_1': SVC,
            #GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
            'dec_tree': DecisionTreeClassifier,
            key_random_forest: RandomForestClassifier,
            'mlp': MLPClassifier,
            'ada_boost': AdaBoostClassifier,
            'guassian_nb': GaussianNB,
            'quadratic_disc': QuadraticDiscriminantAnalysis
             }
            
        clf = classifiers[algo](**kwargs)
        
        n_train = int(len(self.df_attributes)*train_fraction)
        x_train = self.df_attributes[:n_train]
        y_train = self.df_class[self.target_key][:n_train]
        
        x_test = self.df_attributes[n_train:]
        y_test = self.df_class[self.target_key][n_train:]        
        
        
        clf.fit(x_train, y_train)
        score = clf.score(x_test, y_test)
        return score
        
        
        
def perturb(grid=0.1):
    return np.random.uniform(low=-grid, high=grid)/2

In [309]:
explorer_1 = Exploration(raw_df, class_labels)
normaliser = preprocessing.Normalizer()
explorer_1.preprocess_normalise(normaliser)
explorer_1.df_attributes.head()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,SHH_N,BAD_N,BCL2_N,pS6_N,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N
75,1.355087,0.953254,1.432837,1.273848,1.192088,-0.266124,0.014999,0.16954,0.633387,1.039718,...,0.436549,-0.316852,-0.462465,0.599716,0.105433,0.42584,-0.382247,-0.556477,-0.77747,0.495506
76,1.192557,1.006348,1.178596,1.156103,1.213466,0.055942,-0.233701,0.131751,0.278061,0.737196,...,0.709267,-0.485437,-0.11434,0.933204,0.602043,0.222986,-0.584682,-0.229779,-0.737896,0.372148
77,1.294777,1.049963,1.354845,1.367309,1.341366,-0.234112,-0.07928,0.210461,0.302788,1.263868,...,0.677682,-0.212106,0.232738,0.733118,0.768125,0.166292,-0.525804,-0.2421,-0.78893,0.39067
78,0.998958,0.660125,0.586518,0.680651,0.832473,-0.291514,-0.415958,0.193416,-0.245162,0.559208,...,0.24469,-0.490259,-0.687248,-0.019895,-0.186751,0.342883,-0.749745,-0.61663,-0.810445,0.532325
79,0.831697,0.670231,0.611597,0.699871,0.839081,-0.199724,-0.291006,0.23764,-0.242349,0.637167,...,0.575195,-0.49918,-0.465473,0.439539,0.585143,-0.090456,-0.621607,-0.354904,-0.725522,0.458452


In [5]:
def scatter_3d(df, x = 0, y = 1, z = 2):
    colors = df.color

    x,y,z = df[x], df[y], df[z]
    trace1 = go.Scatter3d(
        x=x,
        y=y,
        z=z,
        mode='markers',
        #text=whole_df['artist_name'],
        #name=whole_df['artist_name'],
        showlegend=True,
        marker=dict(
            size=10,
            color=colors,
            colorscale='Jet',
            showscale=True,
            line=dict(
                color=colors,
                width=0.5,
                colorscale='Jet',
            ),

            opacity=1.0
        )
    )

    data = [trace1]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )
    fig = go.Figure(data=data, layout=layout)
    return py.iplot(fig, filename='simple-3d-scatter')

def scatter_2d(df, x = 0, y = 1):
    colors = df.color

    x,y = df[x], df[y]
    trace1 = go.Scatter(
        x=x,
        y=y,
        mode='markers',
        #text=whole_df['artist_name'],
        #name=whole_df['artist_name'],
        showlegend=True,
        marker=dict(
            size=10,
            color=colors,
            colorscale='Jet',
            showscale=True,
            line=dict(
                color=colors,
                width=0.5,
                colorscale='Jet',
            ),

            opacity=1.0
        )
    )

    data = [trace1]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )
    fig = go.Figure(data=data, layout=layout)
    return py.iplot(fig, filename='simple-3d-scatter')

In [292]:
key_kmeans = 'kmeans'
key_birch = 'birch'
key_spectral = 'spectral'
key_colour = 'color'
key_labels = 'labels'
key_random_forest = 'random_forest'

class Exploration(object):
    def __init__(self, raw_df, target_cols):
        """
        This constructor really just needs to take in the dataframe, and separate the data into features and targets.
        TODO: what else can go here?
        """
        cols = raw_df.columns

        self.df_class = raw_df[target_cols] # the data frame which consists solely of the target variables
        self.df_attributes = raw_df.drop(target_cols, axis=1) # the data frame which consists soley of the attributes
    def preprocess_scale(self, scaler, **kwargs):
        self.df_attributes[:] = scaler.fit_transform(self.df_attributes[:], **kwargs)
    def preprocess_normalize(self):
        return
    def preprocess_other(self):
        return
    def set_class_to_explore(self, key):
        self.target_key = key
        
    def pca(self, n):
        X = self.df_attributes
        pca = decomposition.PCA(n_components=n)
        pca.fit(X)
        X = pca.transform(X)
        self.df_pca = pd.DataFrame(X)
        
    def _cluster_kmeans(self, **kwargs):
        n_clusters = len(set(self.df_class[self.target_key]))
        kmeans = KMeans(n_clusters=n_clusters, **kwargs).fit(self.df_pca)
        self.cluster_results = kmeans
        
    def _cluster_birch(self, **kwargs):
        n_clusters = len(set(self.df_class[self.target_key]))
        birch = Birch(n_clusters=n_clusters, **kwargs).fit(self.df_pca)
        self.cluster_results = birch
        
    def _cluster_spectral(self, **kwargs):
        n_clusters = len(set(self.df_class[self.target_key]))
        spectral = SpectralClustering(n_clusters=n_clusters, **kwargs).fit(self.df_pca)
        self.cluster_results = spectral
        
    def cluster(self, algo=key_kmeans, **kwargs):
        if algo==key_kmeans:
            self._cluster_kmeans(**kwargs)
        elif algo==key_birch:
            self._cluster_birch(**kwargs)
        elif algo==key_spectral:
            self._cluster_spectral(**kwargs)
        
    def pca_scatter_cluster(self, n = 2, algo = key_kmeans, **kwargs):
        self.pca(n)
        self.cluster(algo = algo, **kwargs)
        int_labels = self.cluster_results.labels_
        text_labels = ['Cluster {}'.format(l) for l in int_labels]
        self.df_pca[key_labels] = text_labels
        self.df_pca[key_colour] = int_labels
        if n == 3:
            return scatter_3d(self.df_pca)
        return scatter_2d(self.df_pca)
        
    def pca_scatter_class(self, n = 2):
        self.pca(n)
        class_values = self.df_class[self.target_key]
        text_labels = class_values
        df_colour_dict = dict([(class_label, i) for (i, class_label) in enumerate(set(class_values))])
        class_colours = np.array([df_colour_dict[key] for key in class_values], dtype=int)
        self.df_pca[key_labels] = text_labels
        self.df_pca[key_colour] = class_colours
        if n == 3:
            return scatter_3d(self.df_pca)
        return scatter_2d(self.df_pca)
    
    def compare_class_clusters_violin(self):
        cluster_int_labels = np.array(self.cluster_results.labels_)
        class_values = self.df_class[self.target_key]
        df_colour_dict = dict([(class_label, i) for (i, class_label) in enumerate(set(class_values))])
        class_int_labels = np.array([df_colour_dict[key] for key in class_values], dtype=int)
        df = pd.DataFrame()
        df['Class'] = class_values
        df['Cluster'] = cluster_int_labels
    
        fig = ff.create_violin(df, data_header='Cluster', group_header='Class',
                       height=500, width=800)
        return py.iplot(fig, filename='Multiple Violins')
   
    def compare_class_clusters_scatter(self):
        cluster_int_labels = np.array(self.cluster_results.labels_)
        class_values = self.df_class[self.target_key]
        
        count_dict = defaultdict(lambda : defaultdict(int))
        
        for cluster, class_value in zip(cluster_int_labels, class_values):
            count_dict[cluster][class_value] += 1
            
        df = pd.DataFrame.from_dict(count_dict)
        x = 'level_0'
        y = 'level_1'

        x,y = new_df[x], new_df[y]
        trace1 = go.Scatter(
        x=x,
        y=y,
        mode='markers',
        showlegend=True,
        marker=dict(
            size=new_df[0],
            color=new_df[0],
            colorscale='Jet',
            showscale=True,
            line=dict(
                color=new_df[0],
                width=0.5,
                colorscale='Jet',
            ),

            opacity=1.0
        )
        )

        data = [trace1]
        layout = go.Layout(
            margin=dict(
                l=0,
                r=0,
                b=0,
                t=0
            ),
            xaxis=go.XAxis(
                ticks="",
                showticklabels=True,
                tickvals=list(range(len(df.index))),
                ticktext=df.index,
                tickmode="array"
            )
        )
        fig = go.Figure(data=data, layout=layout)
        return py.iplot(fig, filename='simple-3d-scatter')        
    
   
    
    def classify(self, algo = key_random_forest, train_fraction = 0.9, **kwargs):
        
        classifiers = {
            'kneighbors': KNeighborsClassifier,
            'svc_1': SVC,
            #GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
            'dec_tree': DecisionTreeClassifier,
            key_random_forest: RandomForestClassifier,
            'mlp': MLPClassifier,
            'ada_boost': AdaBoostClassifier,
            'guassian_nb': GaussianNB,
            'quadratic_disc': QuadraticDiscriminantAnalysis
             }
            
        clf = classifiers[algo](**kwargs)
        
        n_train = int(len(self.df_attributes)*train_fraction)
        x_train = self.df_attributes[:n_train]
        y_train = self.df_class[self.target_key][:n_train]
        
        x_test = self.df_attributes[n_train:]
        y_test = self.df_class[self.target_key][n_train:]        
        
        
        clf.fit(x_train, y_train)
        score = clf.score(x_test, y_test)
        return score
        
        
        
def perturb(grid=0.1):
    return np.random.uniform(low=-grid, high=grid)/2




In [212]:
raw_df.head()

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class
75,3415_1,0.649781,0.828696,0.405862,2.921435,5.167979,0.207174,0.17664,3.728084,0.239283,...,0.129363,0.486912,0.125152,0.146865,0.143517,1.627181,Control,Memantine,C/S,c-CS-m
76,3415_2,0.616481,0.841974,0.388584,2.862575,5.194163,0.223433,0.167725,3.64824,0.22103,...,0.143084,0.467833,0.112857,0.161132,0.145719,1.562096,Control,Memantine,C/S,c-CS-m
77,3415_3,0.637424,0.852882,0.400561,2.968155,5.35082,0.20879,0.173261,3.814545,0.2223,...,0.147673,0.462501,0.116433,0.160594,0.142879,1.571868,Control,Memantine,C/S,c-CS-m
78,3415_4,0.576815,0.75539,0.348346,2.624901,4.727509,0.205892,0.161192,3.77853,0.194153,...,0.12129,0.47911,0.102831,0.144238,0.141681,1.646608,Control,Memantine,C/S,c-CS-m
79,3415_5,0.542545,0.757917,0.350051,2.634509,4.735602,0.210526,0.165671,3.871971,0.194297,...,0.142617,0.438354,0.110614,0.155667,0.146408,1.607631,Control,Memantine,C/S,c-CS-m


In [293]:
class_labels = ['Genotype','Treatment','Behavior','class', 'MouseID']

explorer_1 = Exploration(raw_df, class_labels)
explorer_1.set_class_to_explore('class')
#scaler = preprocessing.RobustScaler()
scaler = preprocessing.StandardScaler()
explorer_1.preprocess_scale(scaler)

In [310]:
explorer_1.classify(key_random_forest, n_estimators=1000, max_features=77)

AttributeError: 'Exploration' object has no attribute 'classify'

In [270]:
explorer_1.classify('kneighbors')

0.025362318840579712

In [244]:
explorer_1.classify('svc_1')

0.050724637681159424

In [245]:
explorer_1.classify('svc_2')

0.0

In [246]:
explorer_1.classify('dec_tree')

0.079710144927536225

In [247]:
explorer_1.classify('mlp')


Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.



0.032608695652173912

In [249]:
explorer_1.classify('quadratic_disc')


Variables are collinear



0.068840579710144928

In [53]:
new_df = df.unstack(level=1).reset_index()
new_df

Unnamed: 0,level_0,level_1,0
0,0,c-CS-m,16.0
1,0,c-CS-s,
2,0,c-SC-m,31.0
3,0,c-SC-s,12.0
4,0,t-CS-m,23.0
5,0,t-CS-s,12.0
6,0,t-SC-m,31.0
7,0,t-SC-s,21.0
8,1,c-CS-m,1.0
9,1,c-CS-s,35.0


There are several class labels, as well as `MouseID` that I don't want included as factors in a model.

In [40]:
class_labels = ['Genotype','Treatment','Behavior','class', 'MouseID']

explorer_1 = Exploration(raw_df, class_labels)

This exploration uses the birch clustering algo, and will label the PCA scatter based on the the `class` attribute.

In [41]:
explorer_1.set_class_to_explore('class')

The natural clustering of the classes appears to be more strand like than clumpy

In [42]:
explorer_1.pca_scatter_class(n=3)

In [43]:
explorer_1.pca_scatter_cluster(n=3, algo=key_birch)

In [35]:
explorer_1.compare_class_clusters_scatter()

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


In [31]:
explorer_1.pca_scatter_cluster(n=3, algo=key_kmeans)

In [32]:
explorer_1.compare_class_clusters()

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'


Looking at this [comparison](http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py) leads me to think that maybe spectral clustering is better for this data set.

In [29]:
explorer_1.pca_scatter_cluster(n=3, algo=key_spectral, affinity="nearest_neighbors")

In [30]:
explorer_1.compare_class_clusters()

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~podwards_rmit/0 or inside your plot.ly account where it is named 'Multiple Violins'
