**Author**: Justine Debelius<br>
**email**: jdebelius@ucsd.edu<br>
**enviroment**: agp_2017<br>
**Date**: 27 April 2017

This notebook will take the vioscreen export files, convert them to a csv which python can process, and assemble them as a concatenated per-subject file for data process.

Raw files were downloaded from the VIOSCREEN website through the GUI portal, and labeled as "vioscreen_report" with a date range.

In [1]:
import os
import re

import numpy as np
import pandas as pd

We'll start by listing all the vioscreen files, and then converting them from the downloaded csv format, which is not pandas compatable, to a more compatable format.

In [2]:
vioscreen_dir = './01.metadata/vioscreen_reports/'

In [3]:
# Gets the filepaths for unconverted files
fps = [os.path.join(vioscreen_dir, f_) 
       for f_ in os.listdir(vioscreen_dir) 
       if (('.csv' in f_) and ('conv.csv' not in f_))]

# Converts the files to a useful format
for fp in fps:
    new_fp = fp.replace('.csv', '_conv.csv')
    !iconv -c -f "UTF-16LE" -t "US-ASCII" $fp > $new_fp

The reports are then read into pandas, and concatenated into a single file.

In [4]:
reports = pd.concat([pd.read_csv(os.path.join(vioscreen_dir, f_), sep=',', dtype=str) 
                     for f_ in os.listdir(vioscreen_dir) if ('conv.csv' in f_)])

We massage the columns slightly. The comma in the protocol is converted to a `--` to make csv conversion easier.
We also need to strip the `-160` from the username; this is added by the vioscreen database adn does not line up with the survey id stored by the American Gut database.

In [5]:
reports['Protocol'].replace('Knight Lab, University of Colorado Boulder', 
                            'Knight Lab -- University of Colorado Boulder',
                            inplace=True)
reports['Username'] = reports['Username'].apply(lambda x: x.replace('-160', ''))

We'll drop columns we won't use. These columns are maintained for vioscreen, but we do not use them, or because there is no information contained in the column.

In [6]:
drop_columns = ['RECNO',
                'TIME',
                'SRVID',
                'NutrientRecommendation',
                'Gender',
                'Age',
                'Height',
                'Weight',
                'BMI',
                'EER',
                'ActivityLevel',
                'Visit',
                'SubjectId',
                'UserId',
                'DOB',
                'Email',
                'scf',
                'scfv',
                'VitaminC',
                'VitaminCFreq',
                'VitaminCDose',
                'VitaminCAvg',
                'FishOil',
                'FishOilFreq',
                'FishOilDose',
                'FishOilAvg']
reports.drop(labels=drop_columns, axis='columns', inplace=True)

Finally, we'll convert most of the remaining column names to snake case, which is the standard used for other projects, and prefix them with `vioscreen`.

To do this, we'll write a quick function to convert to snake case.

In [7]:
def convert_to_snake(camel):
    """Converts from CamelCase to snake_case
    
    Based on the answer provided in the Stack overflow question:
    http://stackoverflow.com/questions/1175208/
    
    Parameters
    ----------
    camel : str
        The CamelCase string to convert
        
    Returns
    -------
    str
        a snake_case string
        
    
    """
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', camel)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [8]:
new_vioscreen_columns = {c: 'vioscreen_%s' % convert_to_snake(c)
                         for c in reports.columns
                         if c not in {'Username'}
                         }

In [9]:
reports.rename(columns=new_vioscreen_columns, inplace=True)

Finally, we'll set the username as the index, and then save the file. Missing values will be coded as "Unspecified", which is the standard for American Gut.

In [10]:
reports.set_index('Username', inplace=True)

In [11]:
reports.to_csv('./01.metadata/vioscreen_nutrient_report.txt', sep='\t', index_label='survey_id', na_rep="Unspecified")

The report file can now be combined with the metadata, using the survey_id.