Latest Update Date: 2019 Feb.
This package is developed to help users calculate correlation coefficients and covariance matrix of a given data with missing values. In order to implement correlation coefficients and covariance matrix, the standard deviation of the data is needed however the world of data is not always clean and tidy. Python's numpy
fails to return standard deviation and calculation of the correlation coefficients when the data has missing values. This package aims to overcome this obstacle and help users handle missing values when calculating correlation coefficients and covariance matrix. CorrPy
uses likewise deletion method to handle missing values: removing the rows of a data frame where the missing values are present.
Note: If the course timeline permits, CorrPy
will handle missing values via single manipulation with mean value: replacing the missing values with the mean of existing values.
Name | Slack Handle | Github.com | Link |
---|---|---|---|
KERA YUCEL | @KERA YUCEL |
@K3ra-y |
Kera's link |
GOPALAKRISHNAN ANDIVEL | @Krish |
@Gopsathvik |
Krish's link |
WEISHUN DENG | @Wilson Deng |
@xiaoweideng |
Wilson's link |
Mengda Yu | @Mengda(Albert) Yu |
@mru4913 |
Albert's link |
CorrPy
can be installed with pip in a command window:
pip install git+https://github.com/UBC-MDS/CorrPy.git
To test branch coverage, we use coverage.py. You can install by pip install coverage
.
We also create a Makefile to automate the process. You can try the following to observe branch coverage.
make report_branch
The results are shown below.
Name Stmts Miss Branch BrPart Cover Missing
---------------------------------------------------------------------------
CorrPy/__init__.py 4 0 0 0 100%
CorrPy/corr_plus.py 26 0 12 0 100%
CorrPy/cov_mx.py 20 0 8 0 100%
CorrPy/std_plus.py 15 0 8 0 100%
CorrPy/test/__init__.py 0 0 0 0 100%
CorrPy/test/test_corr_plus.py 41 0 0 0 100%
CorrPy/test/test_cov_mx.py 45 0 0 0 100%
CorrPy/test/test_std_plus.py 35 0 0 0 100%
---------------------------------------------------------------------------
To test all the files, we use pytest
by make test_all
.
The results are shown below.
Standard deviation calculates how close the data points to the mean, in which an insight for the variation of the data points. This function would automatically handle the missing values in the input.
std_plus
will omit frustration from workflows.
>>> import CorrPy
>>> x = [1,2, np.nan, 4, np.nan, 6]
>>> std_plus(x)
array([1.920286436967152])
>>> y = [1,2, np.inf, 4, np.nan, 6, "a"]
>>> np.std_plus(y)
array([1.920286436967152])
Correlation coefficients calculates the relationship between two variables as well as the magnitude of this relationship. This function would automatically handle the missing values in the input.
>>> import CorrPy
>>> x = [1,2,np.nan,4,5]
>>> y = [-6,-7,-8,9,True]
>>> corr_plus(x,y)
array([0.7391090892601785])
A Covariance matrix displays the variance and covariance together. This function would use the above two functions.
A covariance matrix displays the variance and covariance together. The diagonal elements represent the variances and the covariances are represented by the other elements in the matrix shown below.
>>> import CorrPy
>>> x = [1,2,np.nan,4,5]
>>> y = [-6,-7,-8,9,True]
>>> cov_mx([x,y])
array([[ 2.33333333, 12.66666667],
[12.66666667, 80.33333333]])
Following functions are already present in Python ecosystem. However, missing values are not being handles for the following functions and CorrPy
package will implement calculation of standard deviation, correlation coefficients and covariance matrix.
Python Standard Deviation: https://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.std.html
Python Correlation Coefficients: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.corrcoef.html
Python Covariance Matrix: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.cov.html
Milestone | Tasks |
---|---|
Milestone 1 | Proposal |
Milestone 2 | Function Code |
Test Code |