# Assembly and demonstration of Kathode's rsMRI QC package

##### This project aims to produce a python package that is capable of taking a csv spreadsheet of information about a large number of resting state MRI scans and performing automated QC to determine which scans are usable and which should be excluded from future analyses. 

## Setting up the package

In [4]:
package_name = "kathodes_package"
%mv package_name/ kathodes_package

In [5]:
from pathlib import Path
python_dir = Path(package_name)
(python_dir / '__init__.py').touch()
Path('setup.py').touch()
Path('LICENSE').touch()
Path('README.md').touch()

### Adding setup.py

In [6]:
%%writefile setup.py
import setuptools

with open("README.md", "r") as fh:
    long_description = fh.read()

setuptools.setup(
    name="kathodes_package", 
    version="0.0.1",
    author="Katherine Soderberg",
    author_email="katherine.soderberg@nih.gov",
    description="A small example package",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/pypa/packaging_demo",
    packages=setuptools.find_packages(),
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent",
    ],
    python_requires='>=3.6',

)

Overwriting setup.py


### Creating README file

In [1]:
%%writefile README.md
# Example Package

This is a package that can read in a table of data describing resting MRI scans and filter 
by specific scanner metrics to perform automated quality control. It produces information
about which scans are high quality enough to proceed with processing. 

Overwriting README.md


### Creating LICENSE file

In [6]:
%%writefile LICENSE
Copyright (c) 2018 The Python Packaging Authority

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Overwriting LICENSE


## Coding the content of the package

### Creating filesummary module

In [3]:
%%writefile kathodes_package/filesummary.py

def summarize(filename):
    """Return basic information about the contents of a csv file with the rows as scan instances."""
    import pandas as pd
    if filename.endswith('.csv'):
        loaded_file = pd.read_csv(filename)
        print("The head of the file is:" + "\n" + str(loaded_file.head()))
        print("Summary information about the file is as follows:")
        print(loaded_file.info())
        number_of_scans = str(len(loaded_file.index))
        print("The file contains information about " + number_of_scans + " scans.")
        closing_statement = filename + " has been summarized."
        return closing_statement
    else: 
        print("The file provided is not in comma separated value format. Please convert your input file to .csv format before using this package.")
        
def listColumns(filename):
    import pandas as pd
    loaded_file = pd.read_csv(filename)
    colNames_list = loaded_file.columns
    return colNames_list


Overwriting kathodes_package/filesummary.py


### Writing test for filesummary to ensure correct file type

In [2]:
%%writefile tests/test_input.py

def test_input_is_csv():
    from kathodes_package import filesummary
    filename = 'sample.csv'
    output = filesummary.summarize(filename)
    assert output == "sample.csv has been summarized."

Overwriting tests/test_input.py


In [2]:
!pytest

platform darwin -- Python 3.7.4, pytest-5.2.1, py-1.8.0, pluggy-0.13.0
rootdir: /Users/katherinesoderberg/Documents/PythonClass/project_spring_2020
plugins: arraydiff-0.3, remotedata-0.3.2, doctestplus-0.4.0, openfiles-0.4.0
collected 2 items                                                              [0m

tests/sample_test.py [32m.[0m[36m                                                   [ 50%][0m
tests/test_input.py [32m.[0m[36m                                                    [100%][0m



### Creating data cleaning module

In [2]:
%%writefile kathodes_package/data_cleaning.py

def removeMissing(filename):
    """Takes a file that contains missing scans and removes those rows, while providing the subject name and reason for removal."""
    import pandas as pd
    import math
    loaded_file = pd.read_csv(filename)
    cleaned_list = []
    missing_counter = 0
    for row in loaded_file.index:
        if math.isnan(loaded_file.iloc[row, 3]):
            print("Dropping subject scan " + loaded_file.iloc[row, 0] + " because of " + loaded_file.iloc[row,1])
            missing_counter = missing_counter + 1
        else:
            cleaned_list.append(loaded_file.iloc[row])
    print("There were " + str(missing_counter) + " scans with missing data dropped.")
    cleaned_df = pd.DataFrame(cleaned_list)
    return cleaned_df

def voxelConsistency(cleaned_dataframe, column_number, expected_size):
    """Checks that every scan has the same voxel dimension, specified by the user."""
    import pandas as pd
    consistency_boolean = True
    for row in range(cleaned_dataframe.index.size - 1):
        if cleaned_dataframe.iloc[row, column_number] != expected_size:
            print("Subject scan " + cleaned_dataframe.iloc[row, 0] + " with row index of " + str(row) + " does not have voxel size of " + str(expected_size))
            consistency_boolean = False
        else:
            continue
    return consistency_boolean


Overwriting kathodes_package/data_cleaning.py


### Creating outlier assessment module

In [5]:
%%writefile kathodes_package/outlier_assessment.py

def outlierStats(outlier_list):
    """Takes a list of outliers and computes the mean and standard deviation"""
    import statistics
    try:
        outlierMean = statistics.mean(outlier_list)
        outlierStdev = statistics.stdev(outlier_list)
        return outlierMean, outlierStdev
    except TypeError :
        explanation = "Cannot compute statistics on a list of non-numerical elements."
        return explanation
    
def outlierExclude(cleaned_dataframe, column_number, stdev_cutoff_factor):
    """Uses outlierStats to determine which scans have outlying volumes above a specified threshold and removes them."""
    import pandas as pd
    column_as_series = cleaned_dataframe.iloc[:,column_number]
    column_as_list = column_as_series.tolist()
    mean, stdev = outlierStats(column_as_list)
    upper_threshold = mean + (stdev * stdev_cutoff_factor)
    lower_threshold = mean - (stdev * stdev_cutoff_factor)
    noOutlier_list = []
    for row in range(cleaned_dataframe.index.size - 1):
        if (cleaned_dataframe.iloc[row, column_number] > upper_threshold) or (cleaned_dataframe.iloc[row, column_number] < lower_threshold):
            print("Dropping subject scan " + cleaned_dataframe.iloc[row, 0] + " due to " + str(cleaned_dataframe.iloc[row, column_number]) + " outlying volumes." )
        else:
            noOutlier_list.append(cleaned_dataframe.iloc[row])
    noOutlier_dataframe = pd.DataFrame(noOutlier_list)
    return noOutlier_dataframe

Overwriting kathodes_package/outlier_assessment.py


In [None]:

# see http://katyhuff.github.io/python-testing/03-exceptions/
def mean(num_list):
    try:
        return sum(num_list)/len(num_list)
    except ZeroDivisionError :
        return 0
    except TypeError as detail :
        msg = "The algebraic mean of an non-numerical list is undefined.\
               Please provide a list of numbers."
        raise TypeError(detail.__str__() + "\n" +  msg)

In [7]:
pip install -e .

Obtaining file:///Users/katherinesoderberg/Documents/PythonClass/project_spring_2020
Installing collected packages: kathodes-package
  Found existing installation: kathodes-package 0.0.1
    Can't uninstall 'kathodes-package'. No files were found to uninstall.
  Running setup.py develop for kathodes-package
Successfully installed kathodes-package
Note: you may need to restart the kernel to use updated packages.


In [8]:
import kathodes_package

## Demonstrating the package with example spreadsheet

#### Show filesummary utilities--summarize file and list column names

In [1]:
from kathodes_package import filesummary
filesummary.summarize("restqclist.040220_KS.csv")
column_names = filesummary.listColumns("restqclist.040220_KS.csv")
print(column_names)

The head of the file is:
                         sub rest1.date.seq  rest1.NumOutliers  rest1.TR  \
0  abductor_lothian.06192015        61915.4               10.0       2.0   
1  abductor_lothian.11082017       110817.4               15.0       2.0   
2    addict_bavduin.09262013        92613.4               12.0       2.0   
3    address_humans.04102014        41014.4                5.0       2.0   
4    address_humans.04082016        40816.4               19.0       2.0   

   rest1.nvox1  rest1.nvox2  rest1.nvox3  rest1.numTRs Keep?  rest2.date.seq  \
0        1.875        1.875          3.0         184.0     y         61915.5   
1        1.875        1.875          3.0         184.0     y        112917.4   
2        1.875        1.875          3.0         184.0     y         92613.5   
3        1.875        1.875          3.0         184.0     y         41014.5   
4        1.875        1.875          3.0         184.0     y         40816.5   

   ...  rest7.nvox2  rest7.nvox3  res

#### Show data_cleaning utilities--remove subject scans with missing data and show reason

In [1]:
from kathodes_package import data_cleaning
cleaned_dataframe = data_cleaning.removeMissing('restqclist.040220_KS.csv')


Dropping subject scan affect_bathwater.11062014 because of notMEMPRAGE
Dropping subject scan airway_sidecar.06232015 because of notMEMPRAGE
Dropping subject scan business_wishbone.06072012 because of notMEMPRAGE
Dropping subject scan cattle_fragolli.02292012 because of problemWithTarball
Dropping subject scan fundraiser_spandau.04162015 because of problemWithTarball
Dropping subject scan melwas_stereo.01282016 because of notMEMPRAGE
Dropping subject scan piazza_sedation.05032013 because of notMEMPRAGE
Dropping subject scan pricks_leyden.12202013 because of notMEMPRAGE
There were 8 scans with missing data dropped.


In [2]:
print(cleaned_dataframe)

                            sub rest1.date.seq  rest1.NumOutliers  rest1.TR  \
0     abductor_lothian.06192015        61915.4               10.0       2.0   
1     abductor_lothian.11082017       110817.4               15.0       2.0   
2       addict_bavduin.09262013        92613.4               12.0       2.0   
3       address_humans.04102014        41014.4                5.0       2.0   
4       address_humans.04082016        40816.4               19.0       2.0   
..                          ...            ...                ...       ...   
470  wordlist_deloitte.12142016       121416.4                5.0       2.0   
471     xpress_discman.03012019        30119.4                8.0       2.0   
472    zakopane_strabo.04132013        41313.4                3.0       2.0   
473      zambia_captor.09082011        90811.4                0.0       2.0   
474      zambia_captor.04242013        42413.4               12.0       2.0   

     rest1.nvox1  rest1.nvox2  rest1.nvox3  rest1.n

In [5]:
print(cleaned_dataframe.index.size)

467


#### Show data_cleaning utilities--check for voxel size consistency

In [2]:
data_cleaning.voxelConsistency(cleaned_dataframe,4,1.875)

Subject scan blowpipe_slains.07082011 with row index of 35 does not have voxel size of 1.875
Subject scan pipette_simenon.07202011 with row index of 335 does not have voxel size of 1.875


False

##### Remove rows with inconsistent voxel sizes

In [3]:
#print rows with different voxel sizes
print(cleaned_dataframe.iloc[35])
print(cleaned_dataframe.iloc[335])

sub                  blowpipe_slains.07082011
rest1.date.seq                        70811.1
rest1.NumOutliers                          24
rest1.TR                                 1.47
rest1.nvox1                              3.75
                               ...           
rest8.TR                                1.695
rest8.nvox1                              3.75
rest8.nvox2                              3.75
rest8.nvox3                                 3
rest8.numTRs                              177
Name: 37, Length: 62, dtype: object
sub                  pipette_simenon.07202011
rest1.date.seq                        72011.1
rest1.NumOutliers                           7
rest1.TR                                    2
rest1.nvox1                               2.5
                               ...           
rest8.TR                                  NaN
rest8.nvox1                               NaN
rest8.nvox2                               NaN
rest8.nvox3                               Na

In [2]:
cleaned_consistent_dataframe = cleaned_dataframe.drop(cleaned_dataframe.index[35])
cleaned_consistent_dataframe = cleaned_consistent_dataframe.drop(cleaned_consistent_dataframe.index[334])

##### Check again for voxel size consistency

In [3]:
data_cleaning.voxelConsistency(cleaned_consistent_dataframe,4,1.875)

True

After using voxelConsistency, all voxels for the specified dimension have the same size. This process can be repeated for other dimensions in other scans.

#### Show outlier_assessment utilities

In [4]:
from kathodes_package import outlier_assessment
#
cc_dataframe_noOutliers = outlier_assessment.outlierExclude(cleaned_consistent_dataframe,2,2)

Dropping subject scan bistro_otitis.06292017 due to 67.0 outlying volumes.
Dropping subject scan cheney_diaspora.03282019 due to 49.0 outlying volumes.
Dropping subject scan contra_cutter.03152012 due to 67.0 outlying volumes.
Dropping subject scan density_flares.08052016 due to 50.0 outlying volumes.
Dropping subject scan dotage_messes.06242019 due to 70.0 outlying volumes.
Dropping subject scan findlay_delays.08302018 due to 51.0 outlying volumes.
Dropping subject scan frazer_status.10202011 due to 50.0 outlying volumes.
Dropping subject scan grunge_aleppo.06192014 due to 51.0 outlying volumes.
Dropping subject scan hanging_bletchley.10222018 due to 50.0 outlying volumes.
Dropping subject scan hilaire_thurrock.10172013 due to 139.0 outlying volumes.
Dropping subject scan hillman_miscue.04102014 due to 70.0 outlying volumes.
Dropping subject scan islington_torsades.10172013 due to 98.0 outlying volumes.
Dropping subject scan mitosis_jekyll.03042016 due to 68.0 outlying volumes.
Droppi

In [5]:
print(cc_dataframe_noOutliers)

                            sub rest1.date.seq  rest1.NumOutliers  rest1.TR  \
0     abductor_lothian.06192015        61915.4               10.0       2.0   
1     abductor_lothian.11082017       110817.4               15.0       2.0   
2       addict_bavduin.09262013        92613.4               12.0       2.0   
3       address_humans.04102014        41014.4                5.0       2.0   
4       address_humans.04082016        40816.4               19.0       2.0   
..                          ...            ...                ...       ...   
469  wordlist_deloitte.05302014        53014.4               10.0       2.0   
470  wordlist_deloitte.12142016       121416.4                5.0       2.0   
471     xpress_discman.03012019        30119.4                8.0       2.0   
472    zakopane_strabo.04132013        41313.4                3.0       2.0   
473      zambia_captor.09082011        90811.4                0.0       2.0   

     rest1.nvox1  rest1.nvox2  rest1.nvox3  rest1.n