STA 663 Final Project
===
Final Report
---


Section 1:  Basic Information
---
Group members:  Hanyu Song, Azeem Zaman

Paper:  Persistent homology transform for modeling shapes and surfaces

Authors:  Turner, Katharine; Mukherjee, Sayan; Boyer, Doug



Section 2:  Project Abstract
---
### Abstract
Our goal was to develop a package to implement the results of the paper by Turner et. al. In particular, we wrote codes to compute the persistent homology transform (PHT) of an object in $\mathbb{R}^3$ and shapes in $\mathbb{R}^2$. PHT is a statistic that completely describes a shape or surface and allows us to determine a metric on the space of piecewise linear shapes, thereby possibly useful for statistical analysis such as clustering. 

### Background

The paper introduces a tool that can be used to perform statistical shape analysis on objects in $\mathbb{R}^3$ and shapes in $\mathbb{R}^2$.  The result can be of interest to topological data analysists (TDA), researchers modeling shapes (such as medical imaging) and morphologists. One of the paper authors use this to compute the distance between heel bones in primates to generate a tree, which can be compared with a tree generated from the genetic distances between primate species.  

Section 3:  Code
---
This section contains a general description of each function, including:
1.  A function to read in files containing the data
2.  A function to construct a persistence diagram given a direction
3.  A function to calculate the distance between persistence diagrams
4.  Functions to generate directions for the construction of persistence diagrams

### a. Modules requirement

The following packages are required for implementation: `math`, `multiprocessing`, `numpy`, `scipy`, `glob` and `numba`.


In [8]:
import math
import multiprocessing as mp
import numpy as np
import scipy.io as sio
import glob
from scipy.optimize import linear_sum_assignment
from numba import jit


### b. Functions for reading in Shapes

Two functions for reading in data are included in the package. The first `read_file` is for reading in text files saved with raw shape data; the second `read_closed_shape` is used to read Matlab `.mat` files saved with closed shape data. Note that each file contains the data of only one shape. Both functions can read all relevant files in a specified directory; both return a list of vertices and edges of each shape, with the vertices and edges saved in two separate `numpy.ndarray`'s.

The usage of each function are explained in further details below:


####   (1)  `read_file(list_files, d)` : Reads in raw shape data files. 

##### Parameters:

`list_files`: A list of text file names. Each file is saved with the raw shape data from one shape. 

`d`: The dimension of the shape, either 2 or 3. 

a. Note that a single dimension parameter is required because we will only compute distances between shapes with the same dimension. It does not make sense to compare objects in $\mathbb{R}^3$ and shapes in $\mathbb{R}^2$.

b. Text files are required to be structured as follows:
1.  The first line should contain two numbers. The first number is the number of vertices in the shape, and the second is the number of edges.
2.  The next lines contain the coordinates of the vertices, one per line. The points should be seperated by spaces.
3.  The last set of lines should contain two integers, representing vertices that have an edge in between.

An example file is given below:


    4 4 <- Number of vertices, number of edges
    -1 1 <- vertex 1
    1 1
    1 -1
    -1 -1 <- vertex 4
    1 2 <- edge from 1 to 2
    2 3
    3 4
    4 1 <- edge from 4 to 1

##### Returns:

`list_objects`: A list of lists. Each embedded list contains two `numpy.ndarray`'s: the first array contains coordinates of the vertices of one shape; the second contains the location of the edges of the shape. 

##### Function `read_files`:

In [15]:
def read_files(list_files, d):
	list_objects = []
	for cur_file in list_files:
		with open(cur_file, "r") as f:
			line = f.readline()
			splitline = line.split()
			num_vert = int(splitline[0])
			num_edges = int(splitline[1])

			vertices = np.empty((num_vert, d))
			edges = np.empty((num_edges, 2))

			# dictionary of vertices {i: v_i}

			for i in range(num_vert):
				line = f.readline()
				splitline = line.split()
				numeric_line = [float(x) for x in splitline]
				vertices[i,:] = np.array(numeric_line)
			for i in range(num_edges):
				line = f.readline()
				splitline = line.split()
				numeric_line = [float(x) for x in splitline]
				edges[i,:] = np.array(numeric_line)
			list_objects.append([vertices, edges])
	return(list_objects)

##### Example:

An example of implementation is given below. Two text file names `'test_obj','test_obj2'` are included in the `list_files`. Each file contains data of shape in $\mathbb{R}^2$, hence $d = 2$. 

In [28]:
res = read_files(list_files = ['test_obj','test_obj2'],d = 2)
res

[[array([[-1.,  1.],
         [-1., -1.],
         [ 1.,  1.],
         [ 1., -1.]]), array([[ 1.,  2.],
         [ 1.,  3.],
         [ 3.,  4.],
         [ 2.,  4.]])], [array([[ 0., -1.],
         [ 0.,  0.],
         [ 1.,  1.],
         [ 2.,  0.],
         [ 3.,  0.],
         [ 3.,  1.],
         [ 2., -1.]]), array([[ 1.,  2.],
         [ 2.,  3.],
         [ 3.,  4.],
         [ 4.,  5.],
         [ 5.,  6.],
         [ 7.,  5.]])]]

As can be seen, the function returns a list of two lists. The first embedded list contains two `numpy.ndarray`'s. The first array 

`array([[-1.,  1.],
        [-1., -1.],
        [ 1.,  1.],
        [ 1., -1.]])` 
        
contains the coordinates of vertices of the shape from the first file `text_obj`.

The second array `array([[ 1.,  2.]` is the location of the edges in the shape, namely an edge exists between vertex 1 and 2.

#### (2)  `read_closed_shape(directory) `: Reads in data of closed shapes save in `.mat` format. 

This function assumes that the shapes are closed, by which we mean that each vertex it connected to the next vertex (i.e.vertex $n$ is connected to vertex $n+1$) and the last vertex is connected to the first.  This is a very specific function, but also a common format used in image analysis.  

##### Parameters:

`directory`: Path to the directory where all the relevant `.mat` files are saved.

##### Returns:
`shapes`: A list of lists. Each embedded list contains two `numpy.ndarray`'s: the first array contains the coordinates of the vertices of one shape; the second contains the location of the edges of the shape.

##### Function  `read_closed_shape`:

In [33]:
def read_closed_shapes(directory):
	"""
	This function reads in all .mat files a specified directory
	"""
	query = directory + "*.mat"
	files = glob.glob(query)
	shapes = []
	for file in files:
		vertices = sio.loadmat(file)['x']
		N = vertices.shape[0]
		edges = np.zeros((N,2))
		edges[N-1,:] = np.array([N, 1])
		for i in range(N-1):
			edges[i,:] = np.array([i+1, i+2])
		shapes.append([vertices, edges])
	return shapes

##### Example:

The example below demonstrates reading in all the `.mat` files in the current directory. As can be seen, the function returns a list of one list. The embedded list contains two `numpy.ndarray`'s. The first array contains the vertices coordinates of the shape from file `Class1_Sample1.mat`. The second array is the location of the edges in the shape, e.g., an edge exists between vertex 1 and 2, vertex 2 and 3. (Indeed this is a closed shape, so vertex $n$ is connected to vertex $n+1$, for all $n$)

In [45]:
res_closed_shp = read_closed_shapes('./')

In [51]:
res_closed_shp

[[array([[  2, 101],
         [  3, 100],
         [  4, 100],
         [  5,  99],
         [  6,  98],
         [  7,  98],
         [  8,  98],
         [  9,  97],
         [  9,  96],
         [ 10,  95],
         [ 10,  94],
         [ 10,  93],
         [ 10,  92],
         [ 10,  91],
         [ 10,  90],
         [ 10,  89],
         [ 10,  88],
         [ 10,  87],
         [ 10,  86],
         [ 11,  85],
         [ 11,  84],
         [ 11,  83],
         [ 11,  82],
         [ 11,  81],
         [ 11,  80],
         [ 11,  79],
         [ 11,  78],
         [ 11,  77],
         [ 11,  76],
         [ 12,  75],
         [ 13,  76],
         [ 14,  75],
         [ 15,  74],
         [ 14,  73],
         [ 13,  72],
         [ 14,  71],
         [ 15,  70],
         [ 16,  69],
         [ 17,  68],
         [ 18,  67],
         [ 19,  66],
         [ 19,  65],
         [ 19,  64],
         [ 20,  64],
         [ 21,  64],
         [ 22,  63],
         [ 23,  62],
         [ 23

### c. A function for persistence diagram construction 

A function to construct a persistence diagram given a direction is included in the package. 

### c. Functions

Algorithms
---
One algorithm used is the Hungarian (or Munkres) algorithm.  The alogirthm is used in situations where assignments with an associated cost must be made and the goal is the select the assignment to minimize the cost.  This algorithm is used to calculate the distance between persistence diagrams.  The distance between persistence diagrams is the distance between is the sum of the distances between the points of the first persistence diagram paired with the points of the second diagram and additional points on the diagonal.  Selecting the pairing that minimzies this distance can be achieved with the Munkres algorithm.

Another algorithm used in our code is the Union-Find algorithm.  This algorithm is used in the construction of the persistence diagrams.  During the construction we must keep track of when disjoint components merge.  We view each component as a tree.  When two components merge we join the the roots of the trees.  This allows us to find when disjoint components merge.

Section 4:  Tests
---
Write some tests. In particular, compare results from our codes to results in the paper to ensure that our codes yield the same results.  Test on simple simulated data.

Section 5:  Optimization
---
This section will describe the steps taken to optimize the speed of the code using methods such as just-in-time compilation, Cython, and possibly alternative algorithms.

Section 6:  Packaging
---
Prepare GitHub repo for distribution.  Prof Mukherjee expressed interest in having the code wrapped for use in R.  If we have time, we will work on this.  