<a href="https://colab.research.google.com/github/AvantiShri/oceanography_colab_notebooks/blob/master/for_clkelly/Colette_N2O_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#for leiden community detection
!pip install leidenalg

Collecting leidenalg
[?25l  Downloading https://files.pythonhosted.org/packages/7e/68/01da5910be71e4fd6f96af7c3c0f31f531c96300bbe50b418c0b5a3eaeb6/leidenalg-0.8.1-cp36-cp36m-manylinux2010_x86_64.whl (2.4MB)
[K     |████████████████████████████████| 2.4MB 2.8MB/s 
[?25hCollecting python-igraph>=0.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/8b/74/24a1afbf3abaf1d5f393b668192888d04091d1a6d106319661cd4af05406/python_igraph-0.8.2-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 22.5MB/s 
[?25hCollecting texttable>=1.6.2
  Downloading https://files.pythonhosted.org/packages/ec/b1/8a1c659ce288bf771d5b1c7cae318ada466f73bd0e16df8d86f27a2a3ee7/texttable-1.6.2-py2.py3-none-any.whl
Installing collected packages: texttable, python-igraph, leidenalg
Successfully installed leidenalg-0.8.1 python-igraph-0.8.2 texttable-1.6.2


Grab the data

In [2]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1lzNG3-ClWIKWwTska9OPBaUb1o8uPA0h' -O 200413_nitrous_oxide_cycling_regimes_data_for_repositories.csv

--2020-07-30 15:41:13--  https://docs.google.com/uc?export=download&id=1lzNG3-ClWIKWwTska9OPBaUb1o8uPA0h
Resolving docs.google.com (docs.google.com)... 173.194.215.102, 173.194.215.100, 173.194.215.138, ...
Connecting to docs.google.com (docs.google.com)|173.194.215.102|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-08-50-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/9i6n9itac6ukbfd97j87ki3s7gr173f3/1596123600000/00395683668588961264/*/1lzNG3-ClWIKWwTska9OPBaUb1o8uPA0h?e=download [following]
--2020-07-30 15:41:13--  https://doc-08-50-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/9i6n9itac6ukbfd97j87ki3s7gr173f3/1596123600000/00395683668588961264/*/1lzNG3-ClWIKWwTska9OPBaUb1o8uPA0h?e=download
Resolving doc-08-50-docs.googleusercontent.com (doc-08-50-docs.googleusercontent.com)... 173.194.214.132, 2607:f8b0:400c:c0b::84
Connecting to doc-08-50-docs.googleusercontent.com (doc-08

From Colette's email

```
So this whole thing started with a plots of N2O isotopomers (columns "d15N-N2Oa_mean (per mil vs. atm N2)", "d15N-N2Ob_mean 
(per mil vs. atm. N2)", and "d18O-N2O_mean (per mil vs. VSMOW)") vs. the inverse of N2O concentration (1/"N2O_mean (nM)"). They are 
a figure in my paper. These plots had a visible change point in them. Patrick has noticed a similar phenomenon in his data. The two 
clusters on a plot like this indicate two different pools of N2O produced from two different sources.

I strongly suspect that nitrite concentration ("Nitrite [uM]" in the spreadsheet) and oxygen ("Seabird Oxygen [umol/L]") also inform the clustering. 
Also the isotopes of nitrite and nitrate ("d18O-NO3 avg (per mil vs. VSMOW)" and so forth). Furthermore I feel like the degree to which they inform 
clustering actually gives us additional information as well. For example, if nitrite concentration is a strong predictor whether a datapoint falls into one 
cluster or another, that tells me that nitrite is likely a substrate for one of these N2O pools.

If we could define a relationship between [N2O] and d18O-N2O, controlling for d15N-N2Oa, that could be interesting. From the rudimentary version of this 
clustering stuff in my paper, we see that d15N-N2Oa looks like it could be an N2O consumption signal. But d18O-N2O does not — or rather, d18O-N2O is more of 
a net production + consumption signal. In reductive waters, d18O-N2O and d15N-N2Oa are both tightly controlled by N2O consumption and thus are very well 
correlated. In these plots, we are making the assumption that these are NOT reductive waters, so it would be interesting to see if these two factors have 
relationships with [N2O] that are independent of each other.
```

Read in the data frame and rename the columns

In [3]:
from matplotlib import pyplot as plt
import numpy as np
import pandas

#Easy remapping of the column names
colnames_map = {'d15N_N2Oa_mean':"d15N-N2Oa_mean (per mil vs. atm N2)",
            'd15N_N2Ob_mean':"d15N-N2Ob_mean (per mil vs. atm. N2)",
            'd18O_N2O_mean':"d18O-N2O_mean (per mil vs. VSMOW)",
            'N2O_mean':"N2O_mean (nM)",
            'd18O_NO3_mean':'d18O-NO3 avg (per mil vs. VSMOW)',
            'd15N_NO3_mean':'d15N-NO3 avg (per mil vs. atm. N2)',
            'd15N_NO2': 'd15N-NO2 (per mil vs. atm N2)',
            'd18O_NO2': 'd18O-NO2 (per mil vs. VSMOW)',
            'Nitrite':"Nitrite [uM]",
            'Oxygen':"Seabird Oxygen [umol/L]",
            'NO3_mean':'NO3_mean (uM)',
            'Depth': 'Target Depth [m]'}

#For some reason, altair chokes when provided data frames with some
# of the original column names. So I am remapping the column names.
def remap_colnames(df, colnames_map):
  foraltair_df = pandas.DataFrame(dict([
      (new_col, np.array(df[orig_col]))
      for new_col,orig_col in colnames_map.items()]))
  return foraltair_df

df = pandas.read_csv("200413_nitrous_oxide_cycling_regimes_data_for_repositories.csv")
filtered_df = remap_colnames(df=df, colnames_map=colnames_map)
#create a column for the inverse of the N2O mean
filtered_df['inv_N2O_mean'] = 1/filtered_df['N2O_mean']

Prepare the features for clustering

In [4]:

#replace nan values with column mean
nanfilled_df = filtered_df.fillna(filtered_df.mean()) 

#for clustering purposes, standardize each column by subtracting mean and
# dividing my standard deviation
for colname in colnames_map:
  vals = np.array(nanfilled_df[colname])
  filtered_df['zscore_'+colname] = (vals-np.mean(vals))/np.std(vals)

columns_to_compare = [
  'zscore_'+x for x in [
      'd15N_N2Oa_mean', 'd15N_N2Ob_mean', 'd18O_N2O_mean',
      'N2O_mean', 'd18O_NO3_mean', 'd15N_NO3_mean',
      'd15N_NO2', 'd18O_NO2', 'Nitrite',
      'Oxygen', 'NO3_mean', 'Depth']]

#prepare a 'features' matrix for each point
features = np.array([np.array(filtered_df[col])
                     for col in columns_to_compare]).transpose((1,0))

Run clustering + lower-dimensional visualization

In [5]:
import leidenalg
import scipy
import sklearn.manifold


#From: https://github.com/theislab/scanpy/blob/8131b05b7a8729eae3d3a5e146292f377dd736f7/scanpy/_utils.py#L159
def get_igraph_from_adjacency(adjacency, directed=None):
    """Get igraph graph from adjacency matrix."""
    import igraph as ig
    sources, targets = adjacency.nonzero()
    weights = adjacency[sources, targets]
    if isinstance(weights, np.matrix):
        weights = weights.A1
    g = ig.Graph(directed=directed)
    g.add_vertices(adjacency.shape[0])  # this adds adjacency.shap[0] vertices
    g.add_edges(list(zip(sources, targets)))
    try:
        g.es['weight'] = weights
    except:
        pass
    if g.vcount() != adjacency.shape[0]:
        print('WARNING: The constructed graph has only '
              +str(g.vcount())+' nodes. '
             'Your adjacency matrix contained redundant nodes.')
    return g


def run_leiden_community_detection(affinity_matrix):
  the_graph = get_igraph_from_adjacency(affinity_matrix)
  partition = leidenalg.find_partition(
                    the_graph, leidenalg.ModularityVertexPartition,
                    weights=(np.array(the_graph.es['weight'])
                             .astype(np.float64)),
                    n_iterations=-1,
                    seed=1234)
  return partition.membership


def run_leiden_using_nearest_neighbors_affmat(features, n_neighbors):
  nearest_neighbors_affmat = sklearn.manifold.SpectralEmbedding(
    n_components=10,
    n_neighbors=n_neighbors,
    affinity='nearest_neighbors').fit(features).affinity_matrix_
  leiden_clusters = run_leiden_community_detection(nearest_neighbors_affmat)
  return leiden_clusters


def run_leiden_using_tsneadapted_distances(features, perplexity):
  pairwise_distances = scipy.spatial.distance.squareform(
      scipy.spatial.distance.pdist(X=features))
  affmat = sklearn.manifold._utils._binary_search_perplexity(
                pairwise_distances.astype("float32"), perplexity, False)
  #symmetrize affinity matrix by addition
  affmat = affmat + affmat.T
  leiden_clusters = run_leiden_community_detection(affmat)
  return leiden_clusters
  return affmat

#Get Leiden communities using t-sne derived distances
PERPLEXITY = 20
leiden_clusters = run_leiden_using_tsneadapted_distances(
    features=features, perplexity=PERPLEXITY)

#derive t-sne embedding given the features
embedding = sklearn.manifold.TSNE(perplexity=PERPLEXITY,
                                  random_state=1234).fit_transform(features)

#Store the results of the clustering and the embedding in the data frame
filtered_df['tsne_axis1'] = embedding[:,0]
filtered_df['tsne_axis2'] = embedding[:,1]
#I am storing the clusters as strings so they automaticall get
# interpreted as categorical
filtered_df['clusters'] = [str(x) for x in leiden_clusters]

View altair interactive visualizations


In [124]:
import altair as alt

DF_TO_USE = filtered_df
INTERVAL = alt.selection_interval()
TOTAL_WIDTH=1200
TOTAL_HEIGHT=680
TSNE_HEIGHTFRAC=0.4
TSNE_WIDTHFRAC=0.2
FONTSIZE=10
PADDING_GUESS=45 #additional padding to subtract off

def get_interactive_histogram(colname):
  yaxis = alt.Y('count():Q', title="Count")
  xaxis = alt.X(colname+':Q', bin=alt.Bin(maxbins=20))
  #apparently height/width doesn't include the space for the
  # axes labels, so these need to be adjusted a bit.
  bg_histogram = alt.Chart(DF_TO_USE).mark_bar().encode(
                    y=yaxis,
                    x=xaxis,
                    color=alt.value('lightgrey')).properties(
                      width=TOTAL_WIDTH*(1-TSNE_WIDTHFRAC)/4
                            - (FONTSIZE+PADDING_GUESS),
                      height=TOTAL_HEIGHT*TSNE_HEIGHTFRAC/3
                            - (FONTSIZE+PADDING_GUESS),
                      selection=INTERVAL)
  fg_histogram = alt.Chart(DF_TO_USE).mark_bar().encode(
                      y=yaxis,
                      color=alt.value('steelblue'),
                      x=xaxis).transform_filter(INTERVAL)
  return (bg_histogram+fg_histogram)

tsne_base = alt.Chart(DF_TO_USE).mark_point().encode(
  color=alt.condition(INTERVAL, 'clusters', alt.value('lightgray'),
                      scale=alt.Scale(scheme='category10'))
).properties(selection=INTERVAL,
             width=TOTAL_WIDTH*TSNE_WIDTHFRAC - (FONTSIZE+PADDING_GUESS),
             height=TOTAL_HEIGHT*TSNE_HEIGHTFRAC - (FONTSIZE+PADDING_GUESS))

base = alt.Chart(DF_TO_USE).mark_point().encode(
  color=alt.condition(INTERVAL, 'clusters', alt.value('lightgray'),
                      scale=alt.Scale(scheme='category10'))
).properties(selection=INTERVAL,
             width=TOTAL_WIDTH/4 - (FONTSIZE+PADDING_GUESS),
             height=(TOTAL_HEIGHT*(1-TSNE_HEIGHTFRAC))/2 
                     - (FONTSIZE+PADDING_GUESS))

alt.vconcat(
    
(tsne_base.encode(x='tsne_axis1', y='tsne_axis2')
| alt.vconcat(get_interactive_histogram('Depth'),
             get_interactive_histogram('Oxygen'),
             get_interactive_histogram('inv_N2O_mean'))
| alt.vconcat(get_interactive_histogram('d15N_N2Oa_mean'),
              get_interactive_histogram('d15N_N2Ob_mean'),
              get_interactive_histogram('d18O_N2O_mean'))
| alt.vconcat(get_interactive_histogram('NO3_mean'),
              get_interactive_histogram('d15N_NO3_mean'),
              get_interactive_histogram('d18O_NO3_mean'))
| alt.vconcat(get_interactive_histogram('Nitrite'),
              get_interactive_histogram('d15N_NO2'),
              get_interactive_histogram('d18O_NO2'))
),

(base.encode(x='inv_N2O_mean', y='d15N_N2Oa_mean')
| base.encode(x='inv_N2O_mean', y='d15N_N2Ob_mean')
| base.encode(x='inv_N2O_mean', y='d18O_N2O_mean')
| base.encode(x='d15N_N2Oa_mean', y='d18O_N2O_mean')
),

(base.encode(x='d15N_NO2', y='Nitrite')
| base.encode(x='d15N_NO2', y='d18O_NO2')
| base.encode(x='Oxygen', y='NO3_mean')
| base.encode(x='d15N_NO3_mean', y='d18O_NO3_mean')
),
#

).configure_axis(labelFontSize=FONTSIZE,
                 titleFontSize=FONTSIZE).properties(padding=0, spacing=0)
# the padding/spacing doesn't propagate to subcharts propertly