**Author**: Justine Debelius<br>
**email**: jdebelius@ucsd.edu<br>
**enviroment**: agp_2017<br>
**Date**: 25 May 2017

The goal of this notebook is to provide demographic summaries for participants in the American Gut and associated projects. We look at metadata, and summarize the available information.

The information generated here will be used for table 1 of the American Gut paper.

We'll start by importing the necessary libraries.

In [1]:
from functools import partial

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import biom

We're going to load the mapping file downloaded from Qiita and merged with all samples.

In [2]:
map_ = pd.read_csv('../01.justine_packaging/01.metadata/ag_full_map.txt', sep='\t', dtype=str)
sotu = biom.load_table('../01.justine_packaging/02.raw_tables/otu_table_no_blooms_125nt_with_tax_min1250.biom')

We'll then select a single instance of a sample, since some samples have been sequenced multiple times.

In [3]:
single_rep = map_.copy().drop_duplicates('original_sample_name')
single_rep.set_index('original_sample_name', inplace=True)

We'll add a column describing the sequencing depth for samples that survived rarefaction at 1250 sequences/sample.

In [4]:
single_rep = single_rep.join(pd.Series(sotu.sum('sample'), index=sotu.ids('sample'), name='seq_depth'))

We'll start by checking the number of participants and the number of blanks.

In [5]:
blanks = single_rep.index[single_rep['body_habitat'] == 'not applicable']
print('There are %i blanks' % len(blanks))
humans = single_rep.drop(blanks)
print('There are %i particiants' % len(humans['host_subject_id'].value_counts()))
print('There are %i participants with at least 1 sample with 1250 sequences/sample.' 
      % np.sum(humans.groupby('host_subject_id').count()['seq_depth'] > 0))

There are 1906 blanks
There are 11336 particiants
There are 10498 participants with at least 1 sample with 1250 sequences/sample.


We'll do the same calculation for the American Gut Project.

We'll use the field, BODY_HABITAT to identify where the sample was collected. We'll also infer that if no value is supplied for BODY_HABITAT, then we will assume it's a blank. We'll use a helper function to rename the values in BODY_HABITAT so the look clean.

In [6]:
def habitat_clean(x):
    if x in {'not applicable'}:
        return 'Blank'
    else:
        return x.split(' ')[0].replace('UBERON:', '').title()

single_rep['body_habitat'] = single_rep['body_habitat'].apply(habitat_clean)

We'll count the number of times a person has a type of sample and then use that to build a matrix comparing the counts.

In [7]:
site_counts = pd.DataFrame(single_rep.groupby(['body_habitat', 'host_subject_id']).count()[['barcode', 'seq_depth']])
site_counts.reset_index(inplace=True)

In [8]:
site_counts.groupby('body_habitat').sum()['seq_depth']

body_habitat
Blank       217
Ear          36
Eye          40
Feces     11689
Hair         15
Nose        179
Oral        984
Skin        779
Vagina       31
Name: seq_depth, dtype: int64

In [9]:
ag_participants = pd.DataFrame({'AG Participants': site_counts['body_habitat'].value_counts(),
                                'AG Samples': site_counts.groupby('body_habitat').sum()['barcode'],
                                'Samples over 1250': site_counts.groupby('body_habitat').sum()['seq_depth']}
                                )
ag_participants.loc['Blank', ['AG Participants']] = np.nan
ag_participants.sort_values('AG Participants', inplace=True, ascending=False)
ag_participants

Unnamed: 0,AG Participants,AG Samples,Samples over 1250
Feces,11005.0,12829,11689
Oral,849.0,1036,984
Skin,359.0,900,779
Nose,80.0,192,179
Eye,37.0,43,40
Ear,34.0,40,36
Vagina,31.0,39,31
Hair,15.0,18,15
Blank,,1906,217


In [10]:
ag_participants.iloc[:-1].sum()

AG Participants      12410.0
AG Samples           15097.0
Samples over 1250    13753.0
dtype: float64

Next, we'll look at the geographic distribution of participants.

In [11]:
humans['dummy'] = 1
humans.replace('Unspecified', np.nan, inplace=True)
humans.replace({'country': {'not applicable', np.nan},
                'state': {'not applicable', np.nan}}, inplace=True)
countries = humans.groupby('country').count()[['dummy', 'seq_depth']].sort_values('dummy', ascending=False)

print('There are participants from at least %i countries.' % len(countries))
print('%i participants did not supply a country.\n' % np.sum(pd.isnull(humans['country'])))
print(countries)

There are participants from at least 46 countries.
16 participants did not supply a country.

                                      dummy  seq_depth
country                                               
USA                                   10798       9765
United Kingdom                         2903       2692
Australia                               349        329
Canada                                  320        296
Belgium                                 103         95
Ireland                                  84         82
Switzerland                              74         69
Morocco                                  70         61
Germany                                  63         56
France                                   45         43
Sweden                                   34         31
Netherlands                              27         22
Norway                                   25         23
not applicable                           23         20
Italy                     

We'll next compare the American, British, and Canadian populations based on age, bmi, sex, last antibiotic use, and racial make up.

In [13]:
humans[['age_corrected', 'bmi_corrected']] = humans[['age_corrected', 'bmi_corrected']].astype(float)

In [14]:
nationalism = humans.loc[humans['country'].apply(lambda x: x in {'USA', 'United Kingdom'})].copy()
nationalism.loc[nationalism['age_corrected'] > 102] = np.nan

Finally, we're going to compare Americans to avaliable summary statistics.
We're going to look at sex, race, smoking, Diabetes and Inflammatory Bowel disease diagnosis, and Body Mass index. To do this, we're going to reformat some of the responses about age.

In [13]:
def mapper(mapping, value):
    return mapping.get(value, value)

diabetes_values_fix = {'I do not have this condition': 'I do not have diabetes',
                       'Diagnosed by a medical professional (doctor, physician assistant)': 'I have diabetes',
                       "Diagnosed by an alternative medicine practitioner": "I have diabetes",
                       'Type I': 'I have diabetes',
                       'Type II': 'I have diabetes',
                       'Self-diagnosed': 'I have diabetes'}

ibd_values_fix = {"Crohn's disease": "I have an IBD",
                  "Diagnosed by a medical professional (doctor, physician assistant)": "I have IBD",
                  "Diagnosed by an alternative medicine practitioner": "I have IBD",
                  "I do not have this condition": "I do not have IBD",
                  "I do not have IBD": "I do not have IBD",
                  "Ulcerative colitis": "I have IBD",
                  "Self-diagnosed": "I have IBD"}

smoking_values_fix = {'Daily': 'I smoke',
                      'Never': 'I do not smoke',
                      'Occasionally (1-2 times/week)': 'I smoke',
                      'Rarely (a few times/month)': 'I smoke',
                      'Rarely (few times/month)': 'I smoke',
                      'Regularly (3-5 times/week)': 'I smoke'}

education_values_fix = {'Did not complete high school': 'Did not complete high school',
                        'High School or GED equilivant': 'High School or GED equilivant',
                        'Some college or technical school': 'Some college or technical school',
                        "Associate's degree": "Associate's degree",
                        "Bachelor's degree": "Bachelor's degree",
                        "Some graduate school or professional": "Bachelor's degree",
                        "Graduate or Professional degree": "Graduate or Professional degree"
                        }

diabetes_map = partial(mapper, diabetes_values_fix)
ibd_map = partial(mapper, ibd_values_fix)
smoke_map = partial(mapper, smoking_values_fix)
education_map = partial(mapper, education_values_fix)

nationalism['diabetes'] = nationalism['diabetes'].apply(diabetes_map)
nationalism['ibd'] = nationalism['ibd'].apply(ibd_map)
nationalism['smoking_frequency'] = nationalism['smoking_frequency'].apply(smoke_map)
nationalism['level_of_education'] = nationalism['level_of_education'].apply(education_map)

We exclude BMI categorization for anyone under the age of 18. According to the World Health Organization (WHO), BMI for children under 18 must be calculated based on their age and gender.

In [14]:
nationalism.loc[nationalism['age_corrected'] < 18, 'bmi_cat'] = np.nan
nationalism.loc[nationalism['age_corrected'] < 25, 'level_of_education'] = np.nan

In [15]:
def age_mod(x):
    if pd.isnull(x):
        return x
    elif x < 5:
        return 'Less than 5'
    elif x < 10:
        return '5 - 10'
    elif x < 20:
        return '11 - 20'
    elif x < 30:
        return '21 - 30'
    elif x < 40:
        return '31 - 40'
    elif x < 50:
        return '41 - 50'
    elif x < 60:
        return '51 - 60'
    elif x < 70:
        return '61 - 70'
    elif x < 80:
        return '71 - 80'
    else:
        return 'Older than 80'

In [16]:
nationalism['age_mod'] = nationalism['age_corrected'].apply(age_mod)

In [17]:
americans = nationalism.loc[nationalism['country'] == 'USA'].copy()

In [18]:
res_table = {}
n_samples = float(len(nationalism))
cats = ['country', 'sex', 'race', 'smoking_frequency', 'diabetes', 'ibd', 'bmi_cat', 'age_mod', 'level_of_education']

for cat in cats[1:]:
    # drop out any null values
    cat_tab = nationalism[cat].dropna()
    
    cat_counts = pd.DataFrame([cat_tab.value_counts(), cat_tab.value_counts(normalize=True) * 100],
                               index=['counts', 'percentage']).to_dict()

    for group, fracs in cat_counts.items():
        res_table[(cat.upper(), group)] = fracs

res = pd.DataFrame.from_dict(res_table, orient='index')

We can compare our summary results to data form the US census

In [23]:
# Category/value : percent in US population 
census_data = {
               # sex (2010 census)
               #https://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf
               ('SEX', 'female'): 50.1,
               ('SEX', 'male'): 49.1,  # this is an over estimate as only the % of females is described in the above URL
               ('SEX', 'other'): 0.03,  # does not appear to be tracked
    
               # Participant ages (2010 census)
               # https://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf
               ('AGE_MOD', 'Less than 5'): (20201362 / 308745538) * 100,
               ('AGE_MOD', '5 - 10'): (20348657 / 308745538) * 100,
               ('AGE_MOD', '11 - 20'): ((20677194 + 22040343) / 308745538) * 100,
               ('AGE_MOD', '21 - 30'): ((21585999 + 21101849) / 308745538) * 100,
               ('AGE_MOD', '31 - 40'): ((19962099 + 20179642) / 308745538) * 100,
               ('AGE_MOD', '41 - 50'): ((20890964 + 22708591) / 308745538) * 100,
               ('AGE_MOD', '51 - 60'): ((22298125 + 19664805) / 308745538) * 100,
               ('AGE_MOD', '61 - 70'): ((16817924 + 12435263) / 308745538) * 100,
               ('AGE_MOD', '71 - 80'): ((9278166 + 7317795) / 308745538) * 100,
               ('AGE_MOD', 'Older than 80'): ((5743327 + 3620459 + 1448366 + 371244 + 53364) / 308745538) * 100,               
               
               # Participant race (2010 census)
               # from http://www.census.gov/prod/cen2010/briefs/c2010br-02.pdf
               # doesn't sum to 100% as the fields don't map exactly, so there may be some overlap represented below
               ('RACE', 'African American'): 12.6,
               ('RACE', 'Asian or Pacific Islander'): 5.0,
               ('RACE', 'Caucasian'): 63.7,
               ('RACE', 'Hispanic'): 16.3,
               ('RACE', 'Other'): 6.2,
    
               # Education (2015 census bureau)
               # from https://www.census.gov/content/dam/Census/library/publications/2016/demo/p20-578.pdf
               ('LEVEL_OF_EDUCATION', 'Did not complete high school'): 11.6,
               ('LEVEL_OF_EDUCATION', 'High School or GED equilivant'): 29.6,
               ('LEVEL_OF_EDUCATION', 'Some college or technical school'): 16.6,
               ('LEVEL_OF_EDUCATION', "Associate's degree"): 9.8,
               ('LEVEL_OF_EDUCATION', "Bachelor's degree"): 20.5,
               ('LEVEL_OF_EDUCATION', 'Graduate or Professional degree'): 12.0,
 
#                ###### we probably want to filter to > 20yo for these values in the metadata
#                # https://www.cdc.gov/nchs/data/hus/2015/058.pdf
                ('BMI_CAT', 'Normal'): 28.9,
                ('BMI_CAT', 'Overweight'): 69.5 - 36.4,
                ('BMI_CAT', 'Obese'): 36.4,
                ('BMI_CAT', 'Underweight'): (100 - 28.9 - 69.5),

               # from PMID 27144261
               ('DIABETES', 'I do not have diabetes'): 90.7,
               ('DIABETES', 'I have diabetes'): 9.3, # This uses 21 million 

               # from http://www.cdc.gov/ibd/ibd-epidemiology.htm
               ('IBD', 'I do not have IBD'): 99.6,
               ('IBD', 'I have IBD'): 0.4,
          
               # from https://www.cdc.gov/mmwr/volumes/65/wr/mm6544a2.htm?s_cid=mm6544a2_w
               ('SMOKING_FREQUENCY', 'I do not smoke'): 84.9,
               ('SMOKING_FREQUENCY', 'I smoke'): 15.1,
}

res['US Census/CDC/NHANES data percentages'] = pd.DataFrame.from_dict(census_data, orient='index')

In [24]:
res

Unnamed: 0,Unnamed: 1,counts,percentage,US Census/CDC/NHANES data percentages
AGE_MOD,11 - 20,427.0,3.250856,13.835839
AGE_MOD,21 - 30,1510.0,11.496003,13.826223
AGE_MOD,31 - 40,2707.0,20.60906,13.001561
AGE_MOD,41 - 50,2397.0,18.248953,14.121517
AGE_MOD,5 - 10,246.0,1.872859,6.590753
AGE_MOD,51 - 60,2571.0,19.573658,13.591429
AGE_MOD,61 - 70,2247.0,17.106966,9.474853
AGE_MOD,71 - 80,704.0,5.359726,5.375288
AGE_MOD,Less than 5,239.0,1.819566,6.543046
AGE_MOD,Older than 80,87.0,0.662352,3.639489
