## Handling `pK.out` files in Python
The `pK.out` files generated by MCCE are text files but they also contain non-numeric data. Hence, a parser function should account for such oddities. Furthermore, often we are interested in comparing multiple `pK.out` files. 

In the following, we attemp writing a simple function that returns a `Numpy` matrix consisting of all the data present in `pK.out`, replacing all non-numeric text with `Nan`. The beneift of this approach is that `Numpy` can perform simple mathemtical operations on matrix that contain `Nan` values by simply ignoring them.


In [5]:
import os
import numpy as np


def get_data_matrix(pkout_file):
    """
    A function to turn a pkout file into a numpy matrix.
    
    Parameters
    ----------
    
    pkout_files : string
        A string containing a valud filename for the file to be processed.

    Returns
    -------
    residue_data : numpy.ndarray
        A m x n matrix where m is the number of residues and n is the number of columns in pk.out file (minus 1, first column is skipped) 
    """
    
    if os.path.isfile(pkout_file):
        pkout = open(pkout_file, "r")
        all_lines = pkout.readlines()
        print all_lines
        header = all_lines[0].split()[1:]
        data = all_lines[1:]
        pkout.close()
        num_res = len(data)
        num_cols = len(header)
        # create empty data matrix
        residue_data = np.zeros([num_res, num_cols])
        # populate matrix
        count = 0
        for i, line in enumerate(data):
            res_data = line.split()[1:]
            if "<" in res_data[0] or ">" in res_data[0] or "l" in res_data[0] or "h" in res_data[0]:
                count += 1
                residue_data[i, 0:3] = np.nan
                for j, value in enumerate(res_data[3:]):
                    #print j, j+3, value, residue_data[i, j+3]
                    residue_data[i, j+3] = float(value)
            else:
                for j, value in enumerate(res_data):
                    residue_data[i, j] = float(value)
        #print "Residues with undefined values: ", count
        return residue_data


Let's now put this function to work by using a sample `pK.out` file. This is what the file looks like.

In [None]:
!more pK.out

In [4]:
pkout_data = get_data_matrix("pK.out")
print pkout_data

[[  1.79140000e+01   1.91000000e+00   2.60000000e-01  -0.00000000e+00
   -1.20000000e-01   0.00000000e+00   9.60000000e-01   1.26000000e+00
    0.00000000e+00  -2.00000000e+00   0.00000000e+00   0.00000000e+00
   -4.36000000e+00  -4.26000000e+00]
 [  2.33280000e+01   1.94400000e+00   5.20000000e-02   0.00000000e+00
    0.00000000e+00   0.00000000e+00   9.20000000e-01   1.76000000e+00
    0.00000000e+00   6.40000000e+00   0.00000000e+00   4.60000000e-01
   -1.20000000e-01   9.42000000e+00]
 [  2.10020000e+01   1.92400000e+00   4.00000000e-03  -8.00000000e-02
   -2.00000000e-02   0.00000000e+00  -6.00000000e-02   2.00000000e-02
    0.00000000e+00  -6.80000000e+00   0.00000000e+00   0.00000000e+00
    4.00000000e-02  -6.88000000e+00]
 [             nan              nan              nan  -2.36000000e+00
   -4.88000000e+00   4.52000000e+00   0.00000000e+00  -4.50000000e+00
    0.00000000e+00   5.80000000e-01  -7.66000000e+00  -1.41800000e+01
    0.00000000e+00   0.00000000e+00]
 [  7.500000

As you can see, this is just a regular `numpy` matrix. You can get averages over columns or rows, add, substract two such matrices etc.