# Blood Pressure Analysis and Prediction
## Josh McComack
### Objectives
I wish to investigate the feasibility of predicting systolic and diastolic blood pressure based on the [National Health and Nutrition Examination Survey](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey/home) (NHANES) data set from 2013-2014 found on kaggle.com. The primary questions I wish to answer are:
1. Which variables from the survey are most predictive empirically and do they correspond to the what mainstream literature identifies as key factors in blood pressure levels?
2. Comparing scikit-learn’s SGDRegressor, MultiTaskLasso, and RandomForestRegressor, which regression model offers the best predictions on this data set?
3. Does the best model perform well enough to serve as a possible supplementary or alternative way of “measuring” blood pressure?


### Import Libraries and Data
First we will import our libraries and data. The NHANES data is stored in 6 seperate csv files.

In [287]:
# Import libraries
import pandas as pd
import numpy as np


In [288]:
# Load Data
path = "./national-health-and-nutrition-examination-survey/"
demographic_df = pd.read_csv(path + "demographic.csv")
diet_df = pd.read_csv(path + "diet.csv")
exam_df = pd.read_csv(path + "examination.csv")
labs_df = pd.read_csv(path + "labs.csv")
# There seems to be a non UTF-8 Character in the medications data set.
meds_df = pd.read_csv(path + "medications.csv", encoding = "latin1")
questionnaire_df = pd.read_csv(path + "questionnaire.csv")

In [289]:
# Join all data frames into a single data frame
combined_df = demographic_df
for frame in [diet_df, exam_df, labs_df, meds_df, questionnaire_df]:
    combined_df = combined_df.merge(frame, left_on="SEQN", right_on="SEQN", how="outer")
combined_df.head(10)

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,...,WHD080U,WHD080L,WHD110,WHD120,WHD130,WHD140,WHQ150,WHQ030M,WHQ500,WHQ520
0,73557,8,2,1,69,,4,4,1.0,,...,,40.0,270.0,200.0,69.0,270.0,62.0,,,
1,73557,8,2,1,69,,4,4,1.0,,...,,40.0,270.0,200.0,69.0,270.0,62.0,,,
2,73558,8,2,1,54,,3,3,1.0,,...,,,240.0,250.0,72.0,250.0,25.0,,,
3,73558,8,2,1,54,,3,3,1.0,,...,,,240.0,250.0,72.0,250.0,25.0,,,
4,73558,8,2,1,54,,3,3,1.0,,...,,,240.0,250.0,72.0,250.0,25.0,,,
5,73558,8,2,1,54,,3,3,1.0,,...,,,240.0,250.0,72.0,250.0,25.0,,,
6,73559,8,2,1,72,,3,3,2.0,,...,,,180.0,190.0,70.0,228.0,35.0,,,
7,73559,8,2,1,72,,3,3,2.0,,...,,,180.0,190.0,70.0,228.0,35.0,,,
8,73559,8,2,1,72,,3,3,2.0,,...,,,180.0,190.0,70.0,228.0,35.0,,,
9,73559,8,2,1,72,,3,3,2.0,,...,,,180.0,190.0,70.0,228.0,35.0,,,


### Data Selection
Our data set contains a total 1824 different columns, with overview of the columns found [here](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey/home). To get started however, we will focus on features which have been identified as possible factors in blood presssure levels:
* Age [1][2]
* Gender [2]
* Sodium intake [1]
* Smoking [1][2]
* Exercise levels [1][2]
* Alcohol intake [1][2]
* Diabetes [2]
* Weight or Obesity [1][2]

* Family history of high blood pressure [1][2]
* Genetics (Which probably ties in with gender, and family history) [1]
* Stress levels [1][2]
* Chronic kidney Disease [1][2]


References:
* [1] [www.webmd.com](https://www.webmd.com/hypertension-high-blood-pressure/guide/blood-pressure-causes#1)
* [2] [www.heart.org](http://www.heart.org/en/health-topics/high-blood-pressure/why-high-blood-pressure-is-a-silent-killer/know-your-risk-factors-for-high-blood-pressure)


In [290]:
# We will create a list to store all of our selected columns for convenience. 
selected_columns = [];

From the demographic data set, we can use the survey participants' age and gender identified in columns DMDHRAGE and RIAGENDR defined in the detailed column description found [here](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Demographics&CycleBeginYear=2013).

In [291]:
demographic_columns = ['DMDHRAGE', 'RIAGENDR']
selected_columns += demographic_columns
demographic_df[demographic_columns].head()

Unnamed: 0,DMDHRAGE,RIAGENDR
0,69,1
1,54,1
2,72,1
3,33,1
4,78,2


The dietary data set contains potentially usefull information about sodium intake and diet, however the detailed column description does not explain how this information is encoded. So we will use the more quantifiable sodium column LBXSNASI taken from the [laboratory data set](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Laboratory&CycleBeginYear=2013). 

Note: there seems to be many more sodium columns missing this data set.

In [292]:
labs_columns = ['LBXSNASI']
selected_columns += labs_columns
labs_df[labs_columns].head()

Unnamed: 0,LBXSNASI
0,136.0
1,128.0
2,142.0
3,
4,142.0


The [questionnaire](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Questionnaire&CycleBeginYear=2013) contains lots of varius data about smoking, exercise, and alcohol habits. A few columns we will use are:
##### Smoking

* SMQ040: Do you now smoke cigarettes?
* SMQ020: Have you smoked at least 100 cigarettes in your entire life?
* SMD650: During the past 30 days, on the days that you smoked, about how many cigarettes did you smoke per day?
* SMD641: On how many of the past 30 days did you smoke a cigarette?

##### Exercise

* PAQ677: On how many of the past 7 days did you exercise or participate in physical activity for at least 20 minutes that made you sweat and breathe hard, such as basketball, soccer, running, swimming laps, fast bicycling, fast dancing, or similar activities?
* PAQ678: On how many of the past 7 days did you do exercises to strengthen or tone your muscles, such as push-ups, sit-ups, or weight lifting?

##### Alcohol

* ALQ101: In any one year, have you had at least 12 drinks of any type of alcoholic beverage? By a drink, I mean a 12 oz. beer, a 5 oz. glass of wine, or one and half ounces of liquor.
* ALQ130: In the past 12 months, on those days that you drank alcoholic beverages, on the average, how many drinks did you have?

##### Diabetes
* DIQ010: Other than during pregnancy, have you ever been told by a doctor or health professional that you have diabetes or sugar diabetes?

In [293]:
questionnaire_columns = ['SMQ040', 'SMQ020', 'SMD650', 'SMD641', 'PAQ677', 'PAQ678', 'ALQ101', 'ALQ130', 'DIQ010']
selected_columns += questionnaire_columns
questionnaire_df[questionnaire_columns].head()

Unnamed: 0,SMQ040,SMQ020,SMD650,SMD641,PAQ677,PAQ678,ALQ101,ALQ130,DIQ010
0,3.0,1.0,,,,,1.0,1.0,1.0
1,2.0,1.0,1.0,1.0,,,1.0,4.0,1.0
2,3.0,1.0,,,,,1.0,,1.0
3,,,,,,,,,2.0
4,,2.0,,,,,1.0,,2.0


Examining the medications data set, we can see who is taking medication for blood pressure and/or diabetes. 


In [294]:
meds_columns = ['RXDDRUG', 'RXDRSD1']
selected_columns += meds_columns
meds_df[meds_columns].head(11)

Unnamed: 0,RXDDRUG,RXDRSD1
0,99999,
1,INSULIN,Type 2 diabetes mellitus
2,GABAPENTIN,Restless legs syndrome
3,INSULIN GLARGINE,Type 2 diabetes mellitus
4,OLMESARTAN,Type 2 diabetes mellitus with kidney complicat...
5,SIMVASTATIN,Pure hypercholesterolemia
6,INSULIN ASPART,Type 2 diabetes mellitus
7,INSULIN GLARGINE,Type 2 diabetes mellitus
8,PANCRELIPASE,"Disease of pancreas, unspecified"
9,SIMVASTATIN,Pure hypercholesterolemia


The data set doesn't seem to contain direct information on stress levels, chronic kidney disease, or family history of high blood pressure. But from the examination data set we can collect information on body mass index and blood pressure readings:

#####  Body Mass Index

* BMXBMI: Body Mass Index (kg/m**2)	

#####  Body Mass Index
* BPXDI 1-4: Diastolic blood pressure mm Hg (first through fourth readings)
* BPXSY 1-4: Systolic blood pressure mm Hg (first through fourth readings)

In [295]:
exam_columns = ['BMXBMI', 'BPXDI1', 'BPXDI2', 'BPXDI3', 'BPXDI4', 'BPXSY1', 'BPXSY2', 'BPXSY3', 'BPXSY4']
selected_columns += exam_columns
exam_df[exam_columns].head()

Unnamed: 0,BMXBMI,BPXDI1,BPXDI2,BPXDI3,BPXDI4,BPXSY1,BPXSY2,BPXSY3,BPXSY4
0,26.7,72.0,76.0,74.0,,122.0,114.0,102.0,
1,28.6,62.0,80.0,42.0,,156.0,160.0,156.0,
2,28.9,90.0,76.0,80.0,,140.0,140.0,146.0,
3,17.1,38.0,34.0,38.0,,108.0,102.0,104.0,
4,19.7,86.0,88.0,86.0,,136.0,134.0,142.0,


### Data Cleaning
Now that we've identified the columns we wish to use, lets drop them into a single data frame.

In [296]:
selected_df = combined_df[selected_columns]
selected_df.head()

Unnamed: 0,DMDHRAGE,RIAGENDR,LBXSNASI,SMQ040,SMQ020,SMD650,SMD641,PAQ677,PAQ678,ALQ101,...,RXDRSD1,BMXBMI,BPXDI1,BPXDI2,BPXDI3,BPXDI4,BPXSY1,BPXSY2,BPXSY3,BPXSY4
0,69,1,136.0,3.0,1.0,,,,,1.0,...,,26.7,72.0,76.0,74.0,,122.0,114.0,102.0,
1,69,1,136.0,3.0,1.0,,,,,1.0,...,Type 2 diabetes mellitus,26.7,72.0,76.0,74.0,,122.0,114.0,102.0,
2,54,1,128.0,2.0,1.0,1.0,1.0,,,1.0,...,Restless legs syndrome,28.6,62.0,80.0,42.0,,156.0,160.0,156.0,
3,54,1,128.0,2.0,1.0,1.0,1.0,,,1.0,...,Type 2 diabetes mellitus,28.6,62.0,80.0,42.0,,156.0,160.0,156.0,
4,54,1,128.0,2.0,1.0,1.0,1.0,,,1.0,...,Type 2 diabetes mellitus with kidney complicat...,28.6,62.0,80.0,42.0,,156.0,160.0,156.0,


We'll define a couple helper methods for evaluating the contents of each column

In [297]:
def print_uniques(df):
    """Prints each unique value for each column in the data frame 
       with a null count.
       
    Args:
        df: Pandas DataFrame to print unique values from.
        
    """
    for column in df.columns:
        print('{} ({}):{}'.format(column, df[column].dtypes, set(df[column].astype('str'))))
        print_null_count(df[column])
        print()
        
def print_null_count(series):
    """Prints the number of null values in the given series.
    
    Args:
        series: Pandas Series to count null values from.
        
    """
    number_null = np.sum(series.isnull())
    size = len(series)
    percent_null = 0 if size <= 0 else (number_null / size)
    print("{} null count: {} ({:.1f}%)".format(series.name, number_null, percent_null))

#####  Demographic Data

In [298]:
print_uniques(selected_df[demographic_columns])

DMDHRAGE (int64):{'18', '53', '74', '35', '59', '24', '28', '26', '23', '71', '49', '36', '46', '70', '38', '34', '79', '45', '57', '21', '60', '27', '55', '32', '73', '51', '65', '29', '47', '20', '62', '77', '54', '72', '68', '66', '22', '44', '50', '31', '40', '64', '42', '67', '43', '56', '48', '58', '37', '75', '78', '69', '33', '39', '76', '19', '41', '80', '30', '25', '61', '52', '63'}
DMDHRAGE null count: 0 (0.0%)

RIAGENDR (int64):{'2', '1'}
RIAGENDR null count: 0 (0.0%)



Thankfully neither column is missing any values. We will let the age column remain an int, but convert the gender column to a category with human readable definitions rather than "1" or "2". Definition can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm) in the data set description.

In [299]:
selected_df['RIAGENDR'] = selected_df['RIAGENDR'].astype('category')
selected_df['RIAGENDR'] = selected_df['RIAGENDR'].map({1:'male', 2:'female'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


#####  Laboratory Data

In [300]:
print_uniques(selected_df[labs_columns])

LBXSNASI (float64):{'148.0', '130.0', '145.0', 'nan', '134.0', '124.0', '132.0', '141.0', '142.0', '147.0', '135.0', '136.0', '144.0', '140.0', '129.0', '131.0', '143.0', '128.0', '146.0', '154.0', '138.0', '137.0', '151.0', '150.0', '119.0', '139.0', '133.0', '127.0'}
LBXSNASI null count: 4677 (0.2%)



Since we have very few missing values, and everyone consumes some amount of sodium, it will probably be a safe bet to fill the missing values with the average from data set. Also, since all the values are really integer values, we will convert the data type.

In [301]:
selected_df['LBXSNASI'].fillna(selected_df['LBXSNASI'].mean(), inplace=True)
selected_df['LBXSNASI'] = selected_df['LBXSNASI'].astype('int')
print_uniques(selected_df[labs_columns])

LBXSNASI (int32):{'134', '143', '147', '144', '140', '127', '128', '132', '138', '119', '139', '151', '137', '135', '130', '129', '141', '124', '136', '133', '150', '154', '146', '131', '142', '148', '145'}
LBXSNASI null count: 0 (0.0%)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


#####  Questionnaire Data

In [311]:
print_uniques(selected_df[questionnaire_columns])

SMQ040 (object):{'smokes every day', 'does not smoke', 'smokes some days'}
SMQ040 null count: 0 (0.0%)

SMQ020 (object):{'yes', 'unknown', 'no'}
SMQ020 null count: 0 (0.0%)

SMD650 (category):{'7', '18', '11', '28', '8', '60', '27', '6', '4', '29', '20', '3', '15', '9', '2', '10', '50', '40', 'unknown', '5', '14', '1', '90', '17', '30', '13', '25', '16', '12'}
SMD650 null count: 0 (0.0%)

SMD641 (category):{'7', '18', '24', '28', '26', '23', '8', '21', '27', '6', '4', '29', '20', '3', '22', '15', '2', '10', '0', 'unknown', '5', '14', '1', '19', '17', '30', '16', '25', '12'}
SMD641 null count: 0 (0.0%)

PAQ677 (category):{'7', '2', '0', '6', '4', 'unknown', '5', '1', '3'}
PAQ677 null count: 0 (0.0%)

PAQ678 (category):{'7', '2', '0', '6', '4', 'unknown', '5', '1', '3'}
PAQ678 null count: 0 (0.0%)

ALQ101 (float64):{'2.0', '9.0', 'nan', '1.0'}
ALQ101 null count: 6157 (0.3%)

ALQ130 (float64):{'12.0', 'nan', '25.0', '4.0', '15.0', '2.0', '10.0', '16.0', '24.0', '7.0', '20.0', '14.0', '999

The smoking field definitions can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/SMQ_H.htm#SMQ040). We will categorize these fields with their associated responses. For the missing values, we will recategorize as "2: Some days" if they smoked at all during the last 30 days. If not, we will categorize as "Not smoking".

In [303]:
selected_df['SMQ040'] = selected_df.apply(
    lambda x:
    # Find if they smoked 1 or more cigarettes in the past 30 days
    2 if np.isnan(x['SMQ040']) and x['SMD650'] >= 1 and x['SMD650'] < 200
    # Find if they smoked 1 or more days in the past 30 days
    else 2 if np.isnan(x['SMQ040']) and x['SMD641'] >= 1 and x['SMD641'] <= 30
    # Classify them as not currently smoking if there is no data from 
    # the last 30 days
    else 3 if np.isnan(x['SMQ040'])
    # Otherwise use the original value
    else x['SMQ040'], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


In [304]:
# Convert to a category with readable definitions
selected_df['SMQ040'] = selected_df['SMQ040'].astype('int').astype('category')
selected_df['SMQ040'] = selected_df['SMQ040'].map({1:'smokes every day', 
                                                   2:'smokes some days',
                                                   3:'does not smoke'})
print_uniques(selected_df[['SMQ040']])

SMQ040 (object):{'smokes every day', 'does not smoke', 'smokes some days'}
SMQ040 null count: 0 (0.0%)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [305]:
# For "Smoked at least 100 cigarettes in life" (SMQ020) we will fill missing
# values with with the "Don't know" classification and change the data 
# type to category.
selected_df['SMQ020'].fillna(9, inplace=True)
selected_df['SMQ020'] = selected_df['SMQ020'].astype('int').astype('category')
selected_df['SMQ020'] = selected_df['SMQ020'].map({1: 'yes', 2:'no', 9:'unknown'})

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [306]:
# For Avg number cigarettes/day during past 30 days (SMD650) we will fill
# missing values with 0 if they don't smoke, and with "Don't know" 
# if they do smoke.
selected_df['SMD650'] = selected_df.apply(
    lambda x:
    0 if np.isnan(x['SMD650']) and x['SMQ040'] == 3
    else 999 if np.isnan(x['SMD650'])
    else x['SMD650'], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [307]:
# Convert to category and classify categories.
# 1 to 90 Range of Values for cigarettes smoked
# 95: 95 cigarettes or more
# 999: Unknown how many smoked
selected_df['SMD650'] = selected_df['SMD650'].astype('int').astype('category')
selected_df['SMD650'] = selected_df['SMD650'].apply(lambda x: 
                                                    'unknown' if x == 999 
                                                    else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [308]:
# Number of days smoked cigs during past 30 days (SMD641).
# We will fill missing values with 0 if they do not smoke,
# else with 'unknown' if they do smoke but the value is missing.
selected_df['SMD641'] = selected_df.apply(
    lambda x:
    0 if np.isnan(x['SMD641']) and x['SMQ040'] == 3
    else 99 if np.isnan(x['SMD641'])
    else x['SMD641'], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [309]:
# Convert to category.
selected_df['SMD641'] = selected_df['SMD641'].astype('int').astype('category')
selected_df['SMD641'] = selected_df['SMD641'].apply(lambda x: 
                                                    'unknown' if x == 99 
                                                    else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [310]:
# PAQ677 and PAQ678 deal with exercise over the past 7 days.
# we will replace any null values with 'unknown'.
selected_df['PAQ677'].fillna(99, inplace=True)
selected_df['PAQ678'].fillna(99, inplace=True)
selected_df['PAQ677'] = selected_df['PAQ677'].astype('int').astype('category')
selected_df['PAQ678'] = selected_df['PAQ678'].astype('int').astype('category')
selected_df['PAQ677'] = selected_df['PAQ677'].apply(lambda x: 
                                                    'unknown' if x == 99 
                                                    else x)
selected_df['PAQ678'] = selected_df['PAQ678'].apply(lambda x: 
                                                    'unknown' if x == 99 
                                                    else x)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying 