Submit a notebook that clearly addresses the following, using code and markdown chunks:

1. Describe the data, particularly what an observation is and whether there are any missing data that might impact your analysis. Who collected the data and why? What known limitations are there to analysis? (10/100 pts)
2. Describe the variables you selected to predict mortality and life expectancy, and the rationale behind them. Analyze your variables using describe tables, kernel densities, scatter plots, and conditional kernel densities. Are there any patterns of interest to notice? (10/100 pts)
3. Using your variables to predict mortality using a k-Nearest Neighbor Classifier. Analyze its performance and explain clearly how you select k. (10/100 pts)
4. Using your variables to predict life expectancy using a k-Nearest Neighbor Regressor. Analyze its performance and explain clearly how you select k. (10/100 pts)
5. Describe how your model could be used for health interventions based on patient characteristics. Are there any limitations or risks to consider? (10/100 pts)

Our variables: calories, proteins, carbs, and fats

In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

### Question 1

In [5]:
mdf = pd.read_csv('linked_mortality_file_1999_2000.csv') 
print( mdf.head() )
gdf = pd.read_sas("DEMO.xpt", format="xport") 
print( gdf.head() )
df = gdf.merge(mdf, on="SEQN", how="inner") 


food = pd.read_sas("DRXTOT.xpt", format="xport") 
print(food.head())
X = df.merge(food, on="SEQN", how="left") 
X.head()

   SEQN  ELIGSTAT  MORTSTAT  UCOD_LEADING  DIABETES  HYPERTEN  PERMTH_INT  \
0     1         2       NaN           NaN       NaN       NaN         NaN   
1     2         1       1.0           6.0       0.0       0.0       177.0   
2     3         2       NaN           NaN       NaN       NaN         NaN   
3     4         2       NaN           NaN       NaN       NaN         NaN   
4     5         1       0.0           NaN       NaN       NaN       244.0   

   PERMTH_EXM  
0         NaN  
1       177.0  
2         NaN  
3         NaN  
4       244.0  
   SEQN  SDDSRVYR  RIDSTATR  RIDEXMON  RIAGENDR  RIDAGEYR  RIDAGEMN  RIDAGEEX  \
0   1.0       1.0       2.0       2.0       2.0       2.0      29.0      31.0   
1   2.0       1.0       2.0       2.0       1.0      77.0     926.0     926.0   
2   3.0       1.0       2.0       1.0       2.0      10.0     125.0     126.0   
3   4.0       1.0       2.0       2.0       1.0       1.0      22.0      23.0   
4   5.0       1.0       2.0       2.

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,DRQ370QQ,DRD370R,DRQ370RQ,DRD370S,DRQ370SQ,DRD370T,DRQ370TQ,DRD370U,DRQ370UQ,DRD370V
0,1.0,1.0,2.0,2.0,2.0,2.0,29.0,31.0,4.0,2.0,...,,2.0,,2.0,,2.0,,2.0,,2.0
1,2.0,1.0,2.0,2.0,1.0,77.0,926.0,926.0,3.0,1.0,...,,2.0,,2.0,,2.0,,2.0,,2.0
2,3.0,1.0,2.0,1.0,2.0,10.0,125.0,126.0,3.0,1.0,...,,,,,,,,,,
3,4.0,1.0,2.0,2.0,1.0,1.0,22.0,23.0,4.0,2.0,...,,2.0,,2.0,,2.0,,2.0,,2.0
4,5.0,1.0,2.0,2.0,1.0,49.0,597.0,597.0,3.0,1.0,...,,,,,,,,,,


In [6]:
X_interesting = X.loc[:,['ELIGSTAT', 'MORTSTAT', 'PERMTH_INT', 'RIDAGEEX', 'DRXTKCAL','DRXTPROT', 'DRXTCARB', 'DRXTTFAT'] ]
X_interesting.head()

Unnamed: 0,ELIGSTAT,MORTSTAT,PERMTH_INT,RIDAGEEX,DRXTKCAL,DRXTPROT,DRXTCARB,DRXTTFAT
0,2,,,31.0,1358.88,31.96,250.36,27.24
1,1,1.0,177.0,926.0,2463.0,123.16,350.37,71.95
2,2,,,126.0,1517.69,40.19,233.63,49.94
3,2,,,23.0,1474.93,56.16,191.03,56.2
4,1,0.0,244.0,597.0,2658.14,97.13,253.98,114.52


1. Describe the data, particularly what an observation is and whether there are any missing data that might impact your analysis. Who collected the data and why? What known limitations are there to analysis? (10/100 pts)

The data is -. An observation of the data is a singular row, showing each unique value for all of the attributes within that row. There are missing data from the MORTSTAT, PERMTH_INT, DIABETES, and other columns. The data was collected by the CDC to use in their National Center for Health Statistics database, 