# Exercise Climate Networks of Indian Monsoon

In this exercise we will explore patterns of rainfall in India during the monsoon (June-July-August-September; JJAS) season.  
We will proceed similarly as in the tutorial:  
1. Load data and preprocess 
2. Pairwise-Intercomparison between all time series
3. Generate Adjacency
4. Generate Network and analyze its communities


In [None]:
# import required packages
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import plot_utils as put
import scipy.stats as st
import networkx as nx  # For network analysis
import networkit as nk  # For community detection
from importlib import reload

## Get Familiar with the data
The data is loaded using the package xarray.  

**Exercise :** The data is provided as daily data. However, precipititation data is very stochastic.  
We therefore analyze weekly data, to better average out daily variations.



In [None]:
# Load the data to an xarray dataset
ds = xr.open_dataset('./data/mswep_pr_1_india_jjas_ds.nc')
# Resample the dataset to weekly values



**Exercise** Get familiar with the annual means and the Quantiles:  
Plot the mean precipitation over the Indian JJAS monsoon season using cartopy as well as 
the 0.9 quantile.

In [None]:
reload(put)
var_name = 'pr'
mean_pr = None # Compute the mean and plot

## Preprocess Data

### Compute anomaly time series
**Exercise**  Plot the time series of the average precipitation over India. 

In [None]:
# Plot the average JJAS rainfal
# Are here any problems?
# Use ds[var_name]


**Exercise:**  Compute next the day of year anomalies. Do you think we have to detrend the data? Why/Why not?

In [None]:
# Compute anomaly time series 
# Group each time point by its corresponding day of the year



In [None]:
# Compute the trends and plot for particular cells


We have seen from the linear fit, that the linear decrease/increase is very little.  
We can therefore conclude that there is no clear trend in the precipitation data over the last 40 years.  
Therefore, we do not need to detrend the data.  


## Adjacency

First the data is prepared to be used properly.

In [None]:
da = ds['anomalies']  # Use the anomaly data to compute the pairwise correlations
print('Dataset shape: ', da.shape)
dim_time, dim_lat, dim_lon = da.shape
# Bring all into a form of an array of time series
data = []
data = []
for idx, t in enumerate(da.time):
        buff = da.sel(time=t.data).data.flatten()  # flatten each time step
        buff[np.isnan(buff)] = 0.0  # set missing data to climatology
        data.append(buff)
data = np.array(data)


**Exercise:** Compute all pair-wise correlations using the Spearman's rank order correlation.  

*Hint: Pay attention to exclude all non-significant correlation values! Take a confidence level of 99.9%.*

**Exercise:** Compute the minimum value of the correaltion that is still accounted as a significant.  
What do you think? Is this a good threshold value? Compute the adjacency matrix for different thresholds.  
What do you think is a good density for the adjecency matrix?

In [None]:
print('Flattend Dataset shape: ', data.shape)
corr, pvalue =  None # .... 
print('Shape of correlation Matrix: ', corr.shape)


Not all correlations are statistically significant.
Let's first exclude non-significant correlations

In [None]:
confidence = 0.999
# Exclude non-significant values

Now finally compute the adjacency matrix of the network. 
Think about how you would choose correlation threshold.  
What might be a problem of too high/low thresholds?

In [None]:
threshold = None # Set a treshold, can be 0
# compute adjacency


An ideal density of the network should be around 5-10%. Setting the threshold to different 
values will change the density accordingly.  
If we finally have the adjacency, we can create an networkx object based on the adjacency.  
Create a networkx object of the adjacency matrix.  

### Analyze the network

First the network is transformed to a networkx object. For this the adjecency has to be a numpy array of shape ($lon\times lat, lon\times lat$)

In [None]:
# Use networkx for better using dealing with the adjacency matrix
import networkx as nx
cnx = nx.DiGraph(adjacency)

# Set the longitude and latitude as node attributes
lons = ds.lon
lats = ds.lat
lon_mesh, lat_mesh = np.meshgrid(lons, lats)  # This gives us a list of longitudes and latitudes per node
nx.set_node_attributes(cnx, {node: lon_mesh.flatten()[node] for node in cnx.nodes()}, 'lon')
nx.set_node_attributes(cnx, {node: lat_mesh.flatten()[node] for node in cnx.nodes()}, 'lat')


Now we make first steps to analyze the network.   
**Exercise:** Compute the node degree of a node $i$ of the network is computed using the Adjacency matrix $A$:  
$$ k_i = \sum_i A_{ij} $$ 


In [None]:
# Compute the node degree and plot it


**Exercise:** Compute the Betweenness Centrality
$$
BC_v(v_i) = \sum_{s,t}^N \frac{\sigma(v_s, v_t|v_i)}{\sigma(v_s, v_t)} \; ,  
$$
where $\sigma (v_s,v_t)$ denotes the number of shortest paths between nodes $v_s$ and $v_t$ and $\sigma(v_s,v_t | v_i) \leq \sigma(v_s,v_t)$ the number of all shortest paths that include node $v_i$.  

*Hint: Look at the [documentation](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.betweenness_centrality.html#networkx.algorithms.centrality.betweenness_centrality)*

You can also try out other network measure.  


In [None]:
# Compute the Betweenness centrality and plot it
# Use the Betweenness centrality function from networkx


In [None]:
reload(put)

# Plot BC


**Exercise:**  Compare your results with [Stolbova et al., 2014](https://npg.copernicus.org/articles/21/901/2014/). Do you find similarities/differences?  
Note, that current literature uses for precipitation analysis often another similarity measure than Spearman's correlation!  
Can you provide an explanation for the concentration of links to the western coast of India?

### Visualize single edges of the network
**Exercise:**  To better analyze single parts of the network we want to extract the links from multiple specific regions.
For the precipitation network, do you spot any particular differences to the global 2m-air temperature networks?  

*Hint: As an example try different locations at the coast, at mountain areas, at high/low latitudes etc.*

Do this by 3 consecutive steps:
1. Find out the source node ids of the region of which you want to analyze the outgoing links
2. Uncover all the edges to this region, using the adjacency or the networkx package (called target nodes)
3. Find out the spatial locations of the target nodes.

In [None]:
# Find out the nodes of the source region
lat_range = [20, 25]
lon_range = [75,78]
# Why is this masked needed?
mean_ds = ds[var_name].mean(dim='time')
mask = (
        (mean_ds['lat'] >= min(lat_range))
        & (mean_ds['lat'] <= max(lat_range))
        & (mean_ds['lon'] >= min(lon_range))
        & (mean_ds['lon'] <= max(lon_range))
        )
source_map =  # Fill this out

# Plot source Ids here for control

# Get Ids of locations
source_ids = np.where(source_map.data.flatten()==1)[0]  # flatten data and get position in array

In [None]:
# Find target Ids in the network
edge_list = []
for sid in source_ids:
        edge_list.append(list(cnx.edges(sid)))

edge_list = np.concatenate(edge_list, axis=0)  # transform to 2d np array

In [None]:
#Plot Edges here

## Community detection in climate Networks
Now we want to see what is the overall structure of the network.  
Therefore we want to identify communities in the network. There are many algorithms to detect communities in graphs.  

**Exercise:** Use the standard [Louvain algorithm](https://en.wikipedia.org/wiki/Louvain_method) from the [NetworKit](https://networkit.github.io/dev-docs/notebooks/Community.html) package to identify communities in the climate network. 

*Hint: Run this algorithm multiple times. Do you notice anything? Where do the differences come from? For this read the documentation of the implementations.*  

What might be a solution for this problem? 

In [None]:
# nk algorithm needs the nx network to be transformed as a nk object
cnk = nk.nxadapter.nx2nk(cnx.to_undirected())
# Use the Parallel Louvain Method (PLM) of NetworkIt
nkCommunities = None # Fill this out

In [None]:
# Plot Communities here


**Exercise:**  Can you explain the different communities? Try to compare the communities with different orographic zones and connect this then back to overall climate.

## Clustering of climate data

**Exercise :** Compute the clusters of based on complete linkage clustering of the Spearman's Correlation Matrix!  
You might follow the method from [Rheinwalt et al. 2015](https://link.springer.com/chapter/10.1007/978-3-319-17220-0_3), moreover our results can be compared to [Malik et al., 2010]( www.nonlin-processes-geophys.net/17/371/2010/) .  
You can use the functions below or try out another clustering Algorithm!

In [None]:
def get_distance(corr, pvalue, confidence=0.999, threshold=None):
    """Get correlation and distance threshold for a given confidence level.

    Note: only positive correlations are considered here

    Return:
    -----
    threshold: float
        Threshold where the clustering is stopped
    distance: np.ndarray (n, n)
        Distance matrix
    corr_pos: np.ndarray (n, n)
        Correlation matrix with only positive correlations
    """
    # get only absolute correlations
    corr_pos = np.abs(corr)

    # get distance matrix
    distance = np.arccos(corr_pos)

    # consider only correlations with corresponding pvalues smaller than (1-confidence)
    mask_confidence = np.where(pvalue <= (
        1 - confidence), 1, 0)  # p-value test
    corr_pos = np.where(mask_confidence == 1, corr_pos, np.nan)

    # get threshold
    if threshold is None:
        idx_min = np.unravel_index(
                np.nanargmin(corr_pos.data), np.shape(corr_pos.data)
            )
    else:
        mask_corr = np.where(corr_pos >= threshold, 
                             corr_pos, np.nan)
        idx_min = np.unravel_index(
                np.nanargmin(mask_corr.data), np.shape(corr_pos.data)
            )
    threshold_corr = corr_pos[idx_min]
    threshold_dist = distance[idx_min]
    
    print(f"p-value {pvalue[idx_min]}, \n",
          f"correlation {threshold_corr} \n",
          f"Min distance threshold {threshold_dist}")

    return distance, threshold_dist

def complete_linkage_cluster(distance, threshold=None, linkage="complete", n_clusters=None):
        """Complete linkage clustering.
        Return:
        -------
        labels: list (n)
            Cluster label of each datapoint
        model: sklearn.cluster.AgglomerativeClustering
            Complete linkage clustering model
        """
        # Use Scipy Agglomerative Clustering for distances clustering!
        from sklearn.cluster import AgglomerativeClustering
        if n_clusters is not None:
            # Exactly one of n_clusters and distance_threshold has to be set,
            # and the other needs to be None. Here we set n_clusters if given!
            threshold = None
        
        # create hierarchical cluster
        model = AgglomerativeClustering(
            distance_threshold=threshold, 
            n_clusters=n_clusters, 
            compute_full_tree=True,
            affinity='precomputed', 
            connectivity=None, 
            linkage=linkage
        )
        labels = model.fit_predict(distance)
        print(
            f"Found {np.max(labels)+1} clusters for the given threshold {threshold}.")
        return labels, model


In [None]:
# Compute Clusters here


## Comparison of Climate Networks to PCA

Climate networks represent a non-linear transformations of the data in order to reduce the dimensionality of the data. PCA is a linear transformation used as well for dimensionality reduction. We can compare the Principial Components to the Network measures to climate network.

**Exercise :**  Apply a PCA on the precipitation anomaly data, visualize the EOF map of the first two components. What do you see by comparing them to node degree plots of the climate network? Do you have an explanation for this similarity?

*Hint: You might have a look at [Donges et al., 2015](https://link.springer.com/article/10.1007/s00382-015-2479-3)!*

In [None]:
from sklearn.decomposition import PCA
# Compute PCA

In [None]:
# Plot EOF maps
i = 0
eof_map = put.create_map_for_da(da=ds[var_name],
                                data=# File in here data,
                                name=f'EOF{i}')


im['ax'].set_title(f"EOF {i+1}")


In [None]:
## Plot Node Degree