# Data input formats

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pre-requirements" data-toc-modified-id="Pre-requirements-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pre-requirements</a></span><ul class="toc-item"><li><span><a href="#Import-dependencies" data-toc-modified-id="Import-dependencies-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import dependencies</a></span></li><li><span><a href="#Notebook-configuration" data-toc-modified-id="Notebook-configuration-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Notebook configuration</a></span></li></ul></li><li><span><a href="#Overview" data-toc-modified-id="Overview-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Points" data-toc-modified-id="Points-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Points</a></span><ul class="toc-item"><li><span><a href="#2D-NumPy-array-of-shape-(n,-d)" data-toc-modified-id="2D-NumPy-array-of-shape-(n,-d)-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>2D NumPy array of shape (<em>n</em>, <em>d</em>)</a></span></li></ul></li><li><span><a href="#Distances" data-toc-modified-id="Distances-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Distances</a></span><ul class="toc-item"><li><span><a href="#2D-NumPy-array-of-shape-(n,-n)" data-toc-modified-id="2D-NumPy-array-of-shape-(n,-n)-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>2D NumPy array of shape (<em>n</em>, <em>n</em>)</a></span></li></ul></li><li><span><a href="#Neighbourhoods" data-toc-modified-id="Neighbourhoods-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Neighbourhoods</a></span></li><li><span><a href="#Densitygraph" data-toc-modified-id="Densitygraph-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Densitygraph</a></span></li></ul></div>

## Pre-requirements

### Import dependencies

In [1]:
import sys

import matplotlib as mpl

import cnnclustering.cnn as cnn  # CNN clustering

In [2]:
# Version information
print(sys.version)

3.8.3 (default, May 15 2020, 15:24:35) 
[GCC 8.3.0]


### Notebook configuration

In [3]:
# Matplotlib configuration
mpl.rc_file(
    "matplotlibrc",
    use_default_template=False
)

In [3]:
# Axis property defaults for the plots
ax_props = {
    "xlabel": None,
    "ylabel": None,
    "xlim": (-2.5, 2.5),
    "ylim": (-2.5, 2.5),
    "xticks": (),
    "yticks": (),
    "aspect": "equal"
}

# Line plot property defaults
line_props = {
    "linewidth": 0,
    "marker": '.',
}

## Overview

A data set of $n$ points can primarily be represented through point coordinates in a $d$-dimensional space, or in terms of a pairwise distance matrix (of arbitrary metric). Secondarily, the data set can be described by neighbourhoods (in a graph structure) with respect to a specific radius cutoff. Furthermore, it is possible to trim the neighbourhoods into a density graph containing density connected points rather then neighbours for each point. The memory demand of the input forms and the speed at which they can be clustered varies. Currently the `cnnclustering.cnn` module can deal with the following data structures ($n$: number of points, $d$: number of dimensions).

__Points__
  
  - 2D NumPy array of shape (*n*, *d*), holding point coordinates

__Distances__

  - 2D NumPy array of shape (*n*, *n*), holding pairwise distances
  
__Neighbourhoods__

  - 1D Numpy array of shape (*n*,) of 1D Numpy arrays of shape (<= *n*,), holding point indices
  - Python list of length (*n*) of Python sets of length (<= *n*), holding point indices
  - Sparse graph with 1D NumPy array of shape (<= *n²*), holding point indices, and 1D NumPy array of shape (*n*,), holding neighbourhood start indices
  
__Density graph__

  - 1D Numpy array of shape (*n*,) of 1D Numpy arrays of shape (<= *n*,), holding point indices
  - Python list of length (*n*) of Python sets of length (<= *n*), holding point indices
  - Sparse graph with 1D NumPy array of shape (<= *n²*), holding point indices, and 1D NumPy array of shape (*n*,), holding connectivity start indices

The different input structures are wrapped by corresponding classes to be handled as attributes of a `CNN` cluster object. Different kinds of input formats corresponding to the same data set are bundled in an `Data` object.

## Points

### 2D NumPy array of shape (*n*, *d*)

The `cnn` module provides the class `Points` to handle data set point coordinates. Instances of type `Points` behave essentially like NumPy arrays.

In [19]:
points = cnn.Points()
print("Representation of points: ", repr(points))
print("Points are Numpy arrays:  ", isinstance(points, np.ndarray))

Representation of points:  Points([], dtype=float64)
Points are Numpy arrays:   True


If you have your data points already in the format of a 2D NumPy array, the conversion into `Points` is straightforward and does not require any copying. Note that the dtype of `Points` is for now fixed to `np.float_`.

In [42]:
original_points = np.array([[0, 0, 0],
                            [1, 1, 1]], dtype=np.float_)
points = cnn.Points(original_points)
points[0, 0] = 1
points

Points([[1., 0., 0.],
        [1., 1., 1.]])

In [43]:
original_points

array([[1., 0., 0.],
       [1., 1., 1.]])

1D sequences are interpreted as a single point on initialisation.

In [45]:
points = cnn.Points(np.array([0, 0, 0]))
points

Points([[0., 0., 0.]])

Other sequences like lists do work as input, too but consider that this requires a copy.

In [47]:
original_points = [[0, 0, 0],
                   [1, 1, 1]]
points = cnn.Points(original_points)
points

Points([[0., 0., 0.],
        [1., 1., 1.]])

`Points` can be used to represent data sets distributed over multiple parts. Parts could constitute independent measurements that should be clustered together but remain separated for later analyses. Internally `Points` stores the underlying point coordinates always as a (vertically stacked) 2D array. `Points.edges` is used to track the number of points belonging to each part. The alternative constructor `Points.from_parts` can be used to deduce `edges` from parts of points passed as a sequence of 2D sequences.

In [64]:
points = cnn.Points.from_parts([[[0, 0, 0],
                                 [1, 1, 1]],
                                [[2, 2, 2],
                                 [3, 3, 3]]])
points

Points([[0., 0., 0.],
        [1., 1., 1.],
        [2., 2., 2.],
        [3., 3., 3.]])

In [65]:
points.edges  # 2 parts, 2 points each

array([2, 2])

Trying to set `edges` manually to a sequence not consistent with the total number of points, will raise an error. Setting the `edges` of an empty `Points` object is, however, allowed and can be used to store part information even when no points are loaded.

In [66]:
points.edges = [2, 3]

ValueError: Part edges (5 points) do not match data points (4 points)

`Points.by_parts` can be used to retrieve the parts again one by one. 

In [70]:
for part in points.by_parts():
    print(f"{part} \n")

[[0. 0. 0.]
 [1. 1. 1.]] 

[[2. 2. 2.]
 [3. 3. 3.]] 



To provide one possible way to calculate neighbourhoods from points, `Points` has a thin method wrapper
for `scipy.spatial.cKDTree`. This will set `Points.tree` which is used by `CNN.calc_neighbours_from_cKDTree`. The user is encouraged to use any other external method instead.

In [75]:
points.cKDTree()
points.tree

<scipy.spatial.ckdtree.cKDTree at 0x7f0f6d3f3900>

## Distances

### 2D NumPy array of shape (*n*, *n*)

The `cnn` module provides the class `Distances` to handle data set pairwise distances as a dense matrix. Instances of type `Distances` behave (like `Points`) much like NumPy arrays.

In [79]:
distances = cnn.Distances([[0, 1], [1, 0]])
distances

Distances([[0., 1.],
           [1., 0.]])

`Distances` do not support an `edges` attribute, i.e. can not represent part information. Use the `edges` of an associated `Points` instance instead.

Pairwise `Distances` can be calculated for $n$ points within a data set from a `Points` instance for example with `CNN.calc_dist`, resulting in a matrix of shape ($n$, $n$). They can be also calculated between $n$ points in one and $m$ points in another data set, resulting in a relative distance matrix (map matrix) of shape ($n$, $m$). In the later case `Distances.reference` should be used to keep track of the `CNN` object carrying the second data set. Such a map matrix can be used to predict cluster labels for a data set based on the fitted cluster labels of another set.

## Neighbourhoods

## Densitygraph