<a href="https://colab.research.google.com/github/AvantiShri/oceanography_colab_notebooks/blob/master/for_clkelly/Colette_N2O_Data_withMeanImputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#for leiden community detection
!pip install leidenalg

Collecting leidenalg
[?25l  Downloading https://files.pythonhosted.org/packages/7e/68/01da5910be71e4fd6f96af7c3c0f31f531c96300bbe50b418c0b5a3eaeb6/leidenalg-0.8.1-cp36-cp36m-manylinux2010_x86_64.whl (2.4MB)
[K     |████████████████████████████████| 2.4MB 2.7MB/s 
[?25hCollecting python-igraph>=0.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/8b/74/24a1afbf3abaf1d5f393b668192888d04091d1a6d106319661cd4af05406/python_igraph-0.8.2-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 20.2MB/s 
[?25hCollecting texttable>=1.6.2
  Downloading https://files.pythonhosted.org/packages/ec/b1/8a1c659ce288bf771d5b1c7cae318ada466f73bd0e16df8d86f27a2a3ee7/texttable-1.6.2-py2.py3-none-any.whl
Installing collected packages: texttable, python-igraph, leidenalg
Successfully installed leidenalg-0.8.1 python-igraph-0.8.2 texttable-1.6.2


Grab the data

Note: if you want to replace this with your own file, you can either:
1. Bypass this wget command entirely and just upload the file to Colab using the panel on the left, or
2. If you want to use the wget command, you need to place your file in Google Drive and then change the link sharing setting to allow anyone with the link to view it. Then, from the link sharing URL, get the file "id" and paste that file id into "id=..." in the command below. Also change "-O ..." to have your desired output file name.

In [2]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1lzNG3-ClWIKWwTska9OPBaUb1o8uPA0h' -O 200413_nitrous_oxide_cycling_regimes_data_for_repositories.csv

--2020-08-25 18:04:41--  https://docs.google.com/uc?export=download&id=1lzNG3-ClWIKWwTska9OPBaUb1o8uPA0h
Resolving docs.google.com (docs.google.com)... 74.125.141.102, 74.125.141.138, 74.125.141.100, ...
Connecting to docs.google.com (docs.google.com)|74.125.141.102|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-50-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/m0cles8o6f84kg4n2fip56hdrgpbe9vb/1598378625000/00395683668588961264/*/1lzNG3-ClWIKWwTska9OPBaUb1o8uPA0h?e=download [following]
--2020-08-25 18:04:41--  https://doc-08-50-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/m0cles8o6f84kg4n2fip56hdrgpbe9vb/1598378625000/00395683668588961264/*/1lzNG3-ClWIKWwTska9OPBaUb1o8uPA0h?e=download
Resolving doc-08-50-docs.googleusercontent.com (doc-08-50-docs.googleusercontent.com)... 172.217.204.132, 2607:f8b0:400c:c15::84
Connecting to doc-08-50-docs.googleusercontent.com (doc-08-50-

From Colette's email

```
So this whole thing started with a plots of N2O isotopomers (columns "d15N-N2Oa_mean (per mil vs. atm N2)", "d15N-N2Ob_mean 
(per mil vs. atm. N2)", and "d18O-N2O_mean (per mil vs. VSMOW)") vs. the inverse of N2O concentration (1/"N2O_mean (nM)"). They are 
a figure in my paper. These plots had a visible change point in them. Patrick has noticed a similar phenomenon in his data. The two 
clusters on a plot like this indicate two different pools of N2O produced from two different sources.

I strongly suspect that nitrite concentration ("Nitrite [uM]" in the spreadsheet) and oxygen ("Seabird Oxygen [umol/L]") also inform the clustering. 
Also the isotopes of nitrite and nitrate ("d18O-NO3 avg (per mil vs. VSMOW)" and so forth). Furthermore I feel like the degree to which they inform 
clustering actually gives us additional information as well. For example, if nitrite concentration is a strong predictor whether a datapoint falls into one 
cluster or another, that tells me that nitrite is likely a substrate for one of these N2O pools.

If we could define a relationship between [N2O] and d18O-N2O, controlling for d15N-N2Oa, that could be interesting. From the rudimentary version of this 
clustering stuff in my paper, we see that d15N-N2Oa looks like it could be an N2O consumption signal. But d18O-N2O does not — or rather, d18O-N2O is more of 
a net production + consumption signal. In reductive waters, d18O-N2O and d15N-N2Oa are both tightly controlled by N2O consumption and thus are very well 
correlated. In these plots, we are making the assumption that these are NOT reductive waters, so it would be interesting to see if these two factors have 
relationships with [N2O] that are independent of each other.
```

Read in the data frame and rename the columns

In [3]:
from matplotlib import pyplot as plt
import numpy as np
import pandas

#Easy remapping of the column names
#The original column name goes on the RIGHT, the desired new column name
# goes on the LEFT.
#Av's note: given that I encountered problems with altair when using
# the original column names from Colette's data (specifically, the altair
# plots would show up blank), I think it is a good idea to make sure your
# new column names avoid fancy characters. If you observe that your altair
# plots turn up blank, try changing the column names.
colnames_map = {'d15N_N2Oa_mean':"d15N-N2Oa_mean (per mil vs. atm N2)",
            'd15N_N2Ob_mean':"d15N-N2Ob_mean (per mil vs. atm. N2)",
            'd18O_N2O_mean':"d18O-N2O_mean (per mil vs. VSMOW)",
            'N2O_mean':"N2O_mean (nM)",
            'd18O_NO3_mean':'d18O-NO3 avg (per mil vs. VSMOW)',
            'd15N_NO3_mean':'d15N-NO3 avg (per mil vs. atm. N2)',
            'd15N_NO2': 'd15N-NO2 (per mil vs. atm N2)',
            'd18O_NO2': 'd18O-NO2 (per mil vs. VSMOW)',
            'Nitrite':"Nitrite [uM]",
            'Oxygen':"Seabird Oxygen [umol/L]",
            'NO3_mean':'NO3_mean (uM)',
            'Depth': 'Target Depth [m]'}

def remap_colnames(df, colnames_map):
  foraltair_df = pandas.DataFrame(dict([
      (new_col, np.array(df[orig_col]))
      for new_col,orig_col in colnames_map.items()]))
  return foraltair_df

df = pandas.read_csv("200413_nitrous_oxide_cycling_regimes_data_for_repositories.csv")
#As mentioned above, altair chokes when provided data frames with some
# of the original column names. So I am remapping the column names.
foraltair_df = remap_colnames(df=df, colnames_map=colnames_map)
#create a column for the inverse of the N2O mean
foraltair_df['inv_N2O_mean'] = 1/foraltair_df['N2O_mean']

Prepare the features for clustering (standardize + impute missing values)

In [4]:
import sklearn.impute

#the columns to use for clustering (these should be in terms of your new
# column names)
columns_to_compare = [
      'd15N_N2Oa_mean', 'd15N_N2Ob_mean', 'd18O_N2O_mean',
      'N2O_mean', 'd18O_NO3_mean', 'd15N_NO3_mean',
      'd15N_NO2', 'd18O_NO2', 'Nitrite',
      'Oxygen', 'NO3_mean', 'Depth']

#for clustering purposes, we standardize each column by subtracting mean and
# dividing by standard deviation
forclustering_df = pandas.DataFrame()
for colname in columns_to_compare:
  vals = np.array(foraltair_df[colname])
  #use nanmean and nanstd to ignore nan values for now
  forclustering_df['zscore_'+colname] = (vals-np.nanmean(vals))/np.nanstd(vals)

#we impute nan values using KNNImputer
forclustering_df = pandas.DataFrame(data=sklearn.impute.KNNImputer(
    missing_values=np.nan, n_neighbors=5,
    weights='distance').fit_transform(forclustering_df),
    columns=forclustering_df.columns)

#prepare a 'features' matrix for each point
features = np.array([np.array(forclustering_df["zscore_"+col])
                     for col in columns_to_compare]).transpose((1,0))

In [8]:
#for contrast generate what happens without the proper missing values imputation
#for clustering purposes, we standardize each column by subtracting mean and
# dividing by standard deviation
meanfill_forclustering_df = pandas.DataFrame()
for colname in columns_to_compare:
  vals = np.array(foraltair_df[colname])
  #use nanmean and nanstd to ignore nan values for now
  meanfill_forclustering_df['zscore_'+colname] = (vals-np.nanmean(vals))/np.nanstd(vals)

#fill missing values with mean
meanfill_forclustering_df = meanfill_forclustering_df.fillna(
    meanfill_forclustering_df.mean()) 

#meanfill features
meanfill_features = np.array([np.array(meanfill_forclustering_df["zscore_"+col])
                     for col in columns_to_compare]).transpose((1,0))
#Get Leiden communities using t-sne derived distances - meanfill
PERPLEXITY = 20
meanfill_leiden_clusters = run_leiden_using_tsneadapted_distances(
    features=meanfill_features, perplexity=PERPLEXITY)

#derive t-sne embedding given the features
meanfill_embedding = sklearn.manifold.TSNE(perplexity=PERPLEXITY,
                                  random_state=123).fit_transform(meanfill_features)


Run clustering + compute lower-dimensional t-sne visualization

In [5]:
import leidenalg
import scipy
import sklearn.manifold


#From: https://github.com/theislab/scanpy/blob/8131b05b7a8729eae3d3a5e146292f377dd736f7/scanpy/_utils.py#L159
def get_igraph_from_adjacency(adjacency, directed=None):
    """Get igraph graph from adjacency matrix."""
    import igraph as ig
    sources, targets = adjacency.nonzero()
    weights = adjacency[sources, targets]
    if isinstance(weights, np.matrix):
        weights = weights.A1
    g = ig.Graph(directed=directed)
    g.add_vertices(adjacency.shape[0])  # this adds adjacency.shap[0] vertices
    g.add_edges(list(zip(sources, targets)))
    try:
        g.es['weight'] = weights
    except:
        pass
    if g.vcount() != adjacency.shape[0]:
        print('WARNING: The constructed graph has only '
              +str(g.vcount())+' nodes. '
             'Your adjacency matrix contained redundant nodes.')
    return g


def run_leiden_community_detection(affinity_matrix, seed):
  the_graph = get_igraph_from_adjacency(affinity_matrix)
  partition = leidenalg.find_partition(
                    the_graph, leidenalg.ModularityVertexPartition,
                    weights=(np.array(the_graph.es['weight'])
                             .astype(np.float64)),
                    n_iterations=-1,
                    seed=seed)
  return partition


def run_leiden_with_multiple_seeds_and_take_best(affinity_matrix, num_seeds):
  best_quality = None
  for seedidx in range(num_seeds):
    partition = run_leiden_community_detection(affinity_matrix, seedidx*100)
    quality = partition.quality()
    if ((best_quality is None) or (quality > best_quality)):
        best_quality = quality
        best_clustering = np.array(partition.membership)
  return best_clustering


def run_leiden_using_tsneadapted_distances(features, perplexity):
  pairwise_distances = scipy.spatial.distance.squareform(
      scipy.spatial.distance.pdist(X=features))
  affmat = sklearn.manifold._utils._binary_search_perplexity(
                pairwise_distances.astype("float32"), perplexity, False)
  #symmetrize affinity matrix by addition
  affmat = affmat + affmat.T
  #run louvain with 3 random seeds and take the best one
  leiden_clusters = run_leiden_with_multiple_seeds_and_take_best(
      affinity_matrix=affmat, num_seeds=3)
  return leiden_clusters


#Get Leiden communities using t-sne derived distances
PERPLEXITY = 20
leiden_clusters = run_leiden_using_tsneadapted_distances(
    features=features, perplexity=PERPLEXITY)

#derive t-sne embedding given the features
embedding = sklearn.manifold.TSNE(perplexity=PERPLEXITY,
                                  random_state=123).fit_transform(features)

#Store the results of the clustering and the embedding in the data frame
foraltair_df['tsne_axis1'] = embedding[:,0]
foraltair_df['tsne_axis2'] = embedding[:,1]
#I am storing the clusters as strings so they automaticall get
# interpreted as categorical
foraltair_df['clusters'] = [str(x) for x in leiden_clusters]

In [10]:
foraltair_df['meanfill_tsne_axis1'] = meanfill_embedding[:,0]
foraltair_df['meanfill_tsne_axis2'] = meanfill_embedding[:,1]
foraltair_df['meanfill_clusters'] = [str(x) for x in meanfill_leiden_clusters]

View altair interactive visualizations


In [13]:
import altair as alt

DF_TO_USE = foraltair_df
INTERVAL_SELECTION = alt.selection_interval()
LEGEND_SELECTION = alt.selection_multi(fields=['clusters'])
COMPOSED_SELECTION = (INTERVAL_SELECTION | LEGEND_SELECTION)
TOTAL_WIDTH=1200
TOTAL_HEIGHT=680
TSNE_HEIGHTFRAC=0.4
TSNE_WIDTHFRAC=0.2
FONTSIZE=10
PADDING_GUESS=45 #additional padding to subtract off

def get_interactive_histogram(colname):
  yaxis = alt.Y('count():Q', title="Count")
  xaxis = alt.X(colname+':Q', bin=alt.Bin(maxbins=100))
  #apparently height/width doesn't include the space for the
  # axes labels, so these need to be adjusted a bit.
  bg_histogram = alt.Chart(DF_TO_USE).mark_bar().encode(
                    y=yaxis,
                    x=xaxis,
                    color=alt.value('lightgrey')).properties(
                      width=TOTAL_WIDTH*(1-TSNE_WIDTHFRAC)/4
                            - (FONTSIZE+PADDING_GUESS),
                      height=TOTAL_HEIGHT*TSNE_HEIGHTFRAC/3
                            - (FONTSIZE+PADDING_GUESS),
                      selection=INTERVAL_SELECTION)
  fg_histogram = alt.Chart(DF_TO_USE).mark_bar().encode(
                      y=yaxis,
                      color=alt.value('steelblue'),
                      x=xaxis).transform_filter(COMPOSED_SELECTION)
  return (bg_histogram+fg_histogram)

#define the color property that will be shared for the scatterplots/legend
color = alt.condition(COMPOSED_SELECTION, 'clusters', alt.value('lightgray'),
                      scale=alt.Scale(scheme='category10'),
                      legend=None)

#base chart for t-sne scatterplot
tsne_base = alt.Chart(DF_TO_USE).mark_point().encode(
  color=color
).properties(width=TOTAL_WIDTH*TSNE_WIDTHFRAC - (FONTSIZE+PADDING_GUESS),
             height=TOTAL_HEIGHT*TSNE_HEIGHTFRAC - (FONTSIZE+PADDING_GUESS)
             ).add_selection(INTERVAL_SELECTION)

#base chart for all other scatterplots
base = alt.Chart(DF_TO_USE).mark_point().encode(
  color=color
).properties(width=TOTAL_WIDTH/4 - (FONTSIZE+PADDING_GUESS),
             height=(TOTAL_HEIGHT*(1-TSNE_HEIGHTFRAC))/2 
                     - (FONTSIZE+PADDING_GUESS)).add_selection(
                         INTERVAL_SELECTION)
#selectable legend
legend = legend = alt.Chart(DF_TO_USE).mark_point().encode(
            y=alt.Y('clusters:N', axis=alt.Axis(orient='right')),
            color=color
        ).add_selection(LEGEND_SELECTION)

#compose the whole layout
alt.vconcat(
    
(tsne_base.encode(x='tsne_axis1', y='tsne_axis2')
| tsne_base.encode(x='meanfill_tsne_axis1', y='meanfill_tsne_axis2')
| alt.vconcat(get_interactive_histogram('Depth'),
             get_interactive_histogram('Oxygen'),
             get_interactive_histogram('inv_N2O_mean'))
| alt.vconcat(get_interactive_histogram('d15N_N2Oa_mean'),
              get_interactive_histogram('d15N_N2Ob_mean'),
              get_interactive_histogram('d18O_N2O_mean'))
| alt.vconcat(get_interactive_histogram('NO3_mean'),
              get_interactive_histogram('d15N_NO3_mean'),
              get_interactive_histogram('d18O_NO3_mean'))
| alt.vconcat(get_interactive_histogram('Nitrite'),
              get_interactive_histogram('d15N_NO2'),
              get_interactive_histogram('d18O_NO2'))
| legend
),

(base.encode(x='inv_N2O_mean', y='d15N_N2Oa_mean')
| base.encode(x='inv_N2O_mean', y='d15N_N2Ob_mean')
| base.encode(x='inv_N2O_mean', y='d18O_N2O_mean')
| base.encode(x='d15N_N2Oa_mean', y='d18O_N2O_mean')
),

(base.encode(x='d15N_NO2', y='Nitrite')
| base.encode(x='d15N_NO2', y='d18O_NO2')
| base.encode(x='Oxygen', y='NO3_mean')
| base.encode(x='d15N_NO3_mean', y='d18O_NO3_mean')
),
#

).configure_axis(labelFontSize=FONTSIZE,
                 titleFontSize=FONTSIZE).properties(padding=0, spacing=0)
# the padding/spacing doesn't propagate to subcharts propertly