STA 663 Final Project
===
Final Report
---


Section 1:  Basic Information
---
Group members:  Hanyu Song, Azeem Zaman

Paper:  Persistent homology transform for modeling shapes and surfaces

Authors:  Turner, Katharine; Mukherjee, Sayan; Boyer, Doug



Section 2:  Project Abstract
---
### Abstract
Our goal was to develop a package to implement the results of the paper by Turner et. al. In particular, we wrote codes to compute the persistent homology transform (PHT) of an object in $\mathbb{R}^3$ and shapes in $\mathbb{R}^2$. PHT is a statistic that completely describes a shape or surface and allows us to determine a metric on the space of piecewise linear shapes, thereby possibly useful for statistical analysis such as clustering. 

### Background

The paper introduces a tool that can be used to perform statistical shape analysis on objects in $\mathbb{R}^3$ and shapes in $\mathbb{R}^2$.  The result can be of interest to topological data analysists (TDA), researchers modeling shapes (such as medical imaging) and morphologists. One of the paper authors use this to compute the distance between heel bones in primates to generate a tree, which can be compared with a tree generated from the genetic distances between primate species.  

Section 3:  Code
---
This section contains a general description of each function, including:
1.  A function to read in files containing the data
2.  A function to construct a persistence diagram given a direction
3.  A function to calculate the distance between persistence diagrams
4.  Functions to generate directions for the construction of persistence diagrams

a. The following packages are required for implementation: $\textbf{math}, \textbf{multiprocessing}, \textbf{numpy}, \textbf{scipy}, \textbf{glob}$ and $\textbf{numba}$.


In [8]:
import math
import multiprocessing as mp
import numpy as np
import scipy.io as sio
import glob
from scipy.optimize import linear_sum_assignment
from numba import jit



Functions for reading in Shapes
---
There are two functions for reading in data included in the package.  The first, `read_mesh_graph` is for reading in raw shape files.  The input files should be text files, structured as follows:
1.  The first line should contain two numbers.  The first number is the number of vertices in the shape, the second is the number of edges.
2.  The next lines contain the coordinates of the vertices, one per line.  The points should be seperated by spaces.
3.  The last set of lines should contain two integers, representing vertices, that have an edge betweent them.

An example of sum a file is given below.

    4 4 <- Number of vertices, number of edges
    -1 1 <- vertex 1
    1 1
    1 -1
    -1 -1 <- vertex 4
    1 2 <- edge from 1 to 2
    2 3
    3 4
    4 1 <- edge from 4 to 1

The other function, `read_closed_shape`, is used to read Matlab `.mat` files.  It reads all `.mat` files in a specified directory.  This function assumes that the shapes are closed, by which we mean that each vertex it connected to the next vertex (vertex $n$ is connected to vertex $n+1$) and the last vertex is connected to the first.  This is a very specific function, but this is a common format used in image analysis.  

Algorithms
---
One algorithm used is the Hungarian (or Munkres) algorithm.  The alogirthm is used in situations where assignments with an associated cost must be made and the goal is the select the assignment to minimize the cost.  This algorithm is used to calculate the distance between persistence diagrams.  The distance between persistence diagrams is the distance between is the sum of the distances between the points of the first persistence diagram paired with the points of the second diagram and additional points on the diagonal.  Selecting the pairing that minimzies this distance can be achieved with the Munkres algorithm.

Another algorithm used in our code is the Union-Find algorithm.  This algorithm is used in the construction of the persistence diagrams.  During the construction we must keep track of when disjoint components merge.  We view each component as a tree.  When two components merge we join the the roots of the trees.  This allows us to find when disjoint components merge.

Section 4:  Tests
---
Write some tests. In particular, compare results from our codes to results in the paper to ensure that our codes yield the same results.  Test on simple simulated data.

Section 5:  Optimization
---
This section will describe the steps taken to optimize the speed of the code using methods such as just-in-time compilation, Cython, and possibly alternative algorithms.

Section 6:  Packaging
---
Prepare GitHub repo for distribution.  Prof Mukherjee expressed interest in having the code wrapped for use in R.  If we have time, we will work on this.  